Kolmogorov, Solomonoff, AIXI etc. questions

Many people seem to have questions about Kolmogorov complexity, Solomonoff induction, algorithmic probability theory, AIXI, the universal intelligence measure and so on.  I don’t always have time to watch all the email lists where these things get discussed, but if you do have any questions, concerns, etc. that you’d like to put to me, feel free to post a question below and I’ll try to answer it.

This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.

47 Responses to Kolmogorov, Solomonoff, AIXI etc. questions

  1. R. Landau says:

    Just one question. What would be the lists where these things get discussed? I’d love to subscribe to some.

    Thanks.

  2. Shane Legg says:

    R. Landau:

    Lots of different blogs from time to time and various email lists. The main lists that I know of are:

    http://www.sl4.org/

    http://www.mail-archive.com/agi@v2.listbox.com/info.html

    http://www.mail-archive.com/singularity@v2.listbox.com/info.html

    Though I’ll warn you that they can sometimes be high volume, and in my opinion many of the posts are nonsense! … but I guess that’s pretty standard for public discussion lists 🙂

    Marcus Hutter also has an email list:
    http://www.hutter1.net/ait.htm#maillist

    There are many of the top people in the field of Kolmogorov complexity on the list. However the traffic is very light (months between posts)

  3. Joe Harris says:

    Just one… Eliezer (overcomingbias.com) totally dismisses the concept of emergence in discussions of AI.

    Do you agree? If you agree, do you think current directions will lead to a computable theory of consciousness?

    Joe

  4. Joe Harris says:

    And another…

    Do you know of a open source SQL compliant database that has been ported to run on a graphics card, preferably with massive parallelism?

    Joe

  5. Shane Legg says:

    Joe: Not quite on topic… but anyway.

    I tend to avoid the word emergence myself. This is because it is often used to mean some kind of magic, as in: we have a system with some cool stuff -> we put all the bits together -> emergence happens -> something truly amazing results!

    In other words, it tends to be used as a word to avoid explaining things. It’s a “miracle occurs” step. This makes it an overused and dangerous word. When I hear it my first reaction is to wonder if somebody is talking a load of rubbish.

    That said, it’s clear that surprising and powerful things can happen when large numbers of simple things interact with each other. In this sense I’d say that evolution is an emergent property of some systems.

    So in short: I think emergence exists and is important. However, unless somebody presents a decent proof that some kind of useful emergence will occur in their system (thorough simulation results or good mathematical proofs) then they are probably just trying to bull shit you. Heuristic arguments aren’t good enough.

    Re: consciousness. I have no idea.

  6. Shane Legg says:

    Joe: Even more off topic…

    GPUs provide massive floating point performance… which isn’t something SQL databases need much of. So such a port doesn’t make much sense to me.

  7. sp says:

    Thank you for offering this opportunity, I have some questions.

    1) One can often hear that Solomonoff universal prior formalizes Occam and Epicurus’ principles. Qualitatively this makes sense: Solomonoff prior does incorporate all possible explanations (Epicurus) and ranks them by their complexity/simplicity (Occam). But can one use a more “quantitative” argument when introducing Solomonoff prior?

    2) A related question: how unique is a definition of the universal prior? For example, could one use sum over 2^{-f(l(p))} when defining the prior probability of a string x, with f(l) being some function, other than f(l)=l? Here l(p) is the length of a program p that outputs x. If not, what does fix a particular choice of f?

    Thanks!

  8. Shane Legg says:

    sp:

    1) I’m not sure what would constitute a more quantitative answer. Perhaps you mean something like: if F(p) is a measure of how “good” a prior p is in some sense, and if we find the best prior according to F then we get the Solomonoff prior?

    2) You can’t use any function f here. For example, if f(x) = 0.5*x then this would double all the probabilities of different strings and potentially make the total probability of all strings add up to more than 1… which is not allowed.

    One thing you can do is to change the reference machine that you’re measuring complexity with respect to. This will move the Solomonoff prior probabilities around (in a way that is bounded by an independent multiplicative constant) . All the results from the theory still hold, e.g. Solomonoff induction still converges fast for “simple” sequences, however now you’ve taken a different view on what is simple vs. what is complex (to an extent). This issue is usually deal with by restricting the complexity of the reference machine, for example, by limiting its state-symbol complexity. In my opinion this fix seems to be a bit ad-hoc.

    So in short, the uniqueness of the universal prior is fairly good, but not great. A number of attempts to come up with a more elegant solution to this have failed. Personally, I suspect that we should take some very basic principle from physics, such as locality and the dimensions of space, and somehow add this as a constraint on the reference machines.

  9. sp says:

    Shane:

    1) Yes, you understood my question correctly. Is there such a way of showing that the Solomonoff prior is the best? I understand that there are bounds in terms of the Kolmogorov complexity, but can one give an argument why this (using the complexity) is good, without invoking Occam?

    2) If I could convince myself that changing the reference machine is the only way to modify the prior, I would be completely happy. Transforming f (x) -> f(x)+c looks quite acceptable to me. But I still do not see that f(x)=x (up to a constant) is the only choice. First, I could impose normalization by hand (so the probability would add to one), second, I could choose a non-monotonic function and perhaps avoid such normalization. I agree that linear function appears to be most natural and convenient. But again, is there a more rigorous argument to rule out other options?

    Thanks!

  10. Shane Legg says:

    sp:

    1) The problem becomes: how do we define “good” for a prior, i.e., what is our F function?

    Solomonoff’s prior is a kind of mixture distribution. In many cases any mixture with non-zero weights will do. See for example Marcus’ AIXI book where much of the work is done with arbitrary Bayes mixtures. Thus, much of the power of Solomonoff’s prior doesn’t derive from Occam’s razor. Though you have to keep in mind that a proper prior over an infinite space still must assign low probabilities to sufficiently complex things in order to keep the sum finite. Thus, a prior can only deviate from Occam’s razor up to a point anyway.

    Nevertheless, there are good reasons to go with Solomonoff’s prior. One important property is its relatively high degree of invariance under reparametrisations and other transformations. There is a short discussion in my thesis about this (which will be out in a few weeks) and a more detailed discussion in:

    http://arxiv.org/abs/0709.1516

    Another argument is the Solomonoff bound itself. If you think about it, a general purpose predictor that doesn’t know the true distribution \mu must make at the very least an expected number of errors on the order of K(\mu). In other words, in general it can’t learn what the true distribution is in fewer than the minimum possible number of bits needed to actually describe the distribution. Solomonoff induction achieves this lower bound on errors within a factor of ln2/2. I don’t know off the top of my head, but I suspect that this extra multiplicative factor also might be unavoidable.

    I’ll reply to 2) in another comment…

  11. Shane Legg says:

    sp:

    2) Something like f(x) -> f(x)+c for integer c>=0 can easily be done with a change in reference machine: just make the reference machine ignore the first c bits of input. Similarly for f(x) -> 2x etc. Indeed, any f that is equivalent to some computable permutation of the input programs can be achieved by a suitable change in the reference machine. Essentially you just construct a new reference machine that runs a decoding algorithm on the input program before passing the result to the original reference machine.

  12. sp says:

    Shane:

    > Solomonoff’s prior is a kind of mixture distribution. In many cases any mixture with non-zero weights will do. See for example Marcus’ AIXI book where much of the work is done with arbitrary Bayes mixtures.

    Well, I think the Solomonoff prior is equivalent to a mixture over all (semi)computable (semi)measures. For a Bayes mixture one can express convergence bounds in terms of the mixture weights, I agree.

    > Thus, much of the power of Solomonoff’s prior doesn’t derive from Occam’s razor.

    But I thought that a practical value of the Solomonoff prior is that it gives a prescription for how to set the weights, which in a sense comes from the Occam’s razor. So my questions was how unique the choice of weights is. Perhaps I am repeating myself.

    > Though you have to keep in mind that a proper prior over an infinite space still must assign low probabilities to sufficiently complex things in order to keep the sum finite. Thus, a prior can only deviate from Occam’s razor up to a point anyway.

    Does this “up to a point” allow, let’s say, f(x)=x^2?

    > There is a short discussion in my thesis about this (which will be out in a few weeks) and a more detailed discussion in:

    I will take a look.

    > Another argument is the Solomonoff bound itself. If you think about it, a general purpose predictor that doesn’t know the true distribution \mu must make at the very least an expected number of errors on the order of K(\mu).

    Again, why not K^2(\mu)+c, which would correspond to f(x)=x^2+c.

    Personally, I do strongly feel that the universal prior how it is defined, is about the only reasonable way (up to O(1) notations) to do it. One way I convince myself is to consider two systems (think about two distributions) A and B, that are completely independent, have nothing in common, in the sense that K(A|B)=K(A) (likewise K(B|A)=K(B), as usual upto O(1)). Then the two can be considered together and separately, and a meaningful universal prior P for these systems should have a property P_{A+B}=P_A P_B. From where f(x)=x+O(1) would follow.

    But this is only a plausibility argument (which is fine with me). But then I can imaging that a sceptical opponent could ask the questions I am asking here, and I would not know how to answer rigorously. My hope is that somebody can do it.

    > Something like f(x) -> f(x)+c for integer c>=0 can easily be done with a change in reference machine: just make the reference machine ignore the first c bits of input. Similarly for f(x) -> 2x etc

    I thought that f(x) -> x+c is a consequence of the invariance (with respect to a reparametrization) theorem, which holds for partial recursive functions. This would not include f(x) -> 2x. Am I missing something?

    Thanks!

  13. Shane Legg says:

    sp:

    > I think the Solomonoff prior is equivalent to a mixture over all (semi)computable (semi)measures.

    I’m being a bit vague about what I’m calling Solomonoff’s prior here. In his original technical report he defined a prior by putting a stream of uniformly random bits into a UTM. This is the one people often associate with his name. However, he also mentions taking a mixture distribution approach. In any case, yes, they are in some senses equivalent. In my thesis I call the latter the Solomonoff-Levin prior to distinguish the two.

    > But I thought that a practical value of the Solomonoff prior is that it gives a prescription for how to set the weights, which in a sense comes from the Occam’s razor.

    I guess the Solomonoff-Levin prior basically consists of 3 things:
    1) Take a huge space such as all the enumerable semi-measures.
    2) Take a mixture over this space to define a prior.
    3) Weight this mixture according to Kolmogorov complexity, i.e. Occam’s razor.

    As far as I know, nobody before Solomonoff was taking mixtures over very large spaces such as the space of all enumerable semi-measures. So even if you don’t use a Kolmogorov weighting, you still should give some credit to Solomonoff for having this idea that universal AI could be approached by taking these huge spaces that contain all effective hypotheses over arbitrary strings and then defining a mixture prior and approaching the problem probabilistically.

    > Does this “up to a point” allow, let’s say, f(x)=x^2?

    Yes, you could define a weighting function for a mixture in this way, and it would still be a semi-measure, obviously. (I’m talking about arbitrary weighting function here). My point here is that you can’t assign significant prior probability to a non-trivial proportion of complex objects because otherwise they will contribution too much to the total probability mass and push it over 1.

    > Again, why not K^2(\mu)+c, which would correspond to f(x)=x^2+c.

    What does 2(\mu) mean?

    > I thought that f(x) -> x+c is a consequence of the invariance…

    Yeah, things are getting a bit confused here! Sorry about that. At some point, any stretching introduced to the code lengths will be compensated for by the reference machine itself (in the Solomonoff-Levin prior, in the plain Solomonoff prior it’s harder to see but effectively still there). So yes, for sufficiently large x you recover the +c bound for an independent c.

    Let me go back to your original question as it seems to be the clearest statement of your objection:

    > could one use sum over 2^{-f(l(p))} when defining the prior probability of a string x, with f(l) being some function, other than f(l)=l? Here l(p) is the length of a program p that outputs x. If not, what does fix a particular choice of f?

    If f reduces the code lengths then we start to run into problems as we no longer get a semi-measure. The extent to which a general code length reduction is possible will already be taken into account by the normalisation (in this context it makes sense to talk about a proper probability distribution).

    On the other hand, if f lengthens the codes (and this isn’t renormalised out) then you get lower prior probabilities and worse performance. If you normalise to compensate for this, then you’re back to where you started.

    What we desire then is to have the shortest code lengths possible. Taking all p that produce x into account (which will include the shortest p) gives you this. Any remaining inefficiency in the set of self-delimiting valid programs will be normalised out.

    In other words, you want to maximise the prior probabilities, which is to say that you want to minimise the code lengths… which you’re already doing by construction. Messing around with f can’t improve on this.

  14. Denis says:

    Hi Shane,
    first of all congratulation for SIAI prize .
    I am pleasure to know your opinion about what I think be an interesting fact about Kolmogorov complexity , I try to explain my think.
    Im many lectures I made related to Kolmogorov complexity I found very often the usage to approximate the K(x) with log(x) .
    From a mathematical point of view this is a good approximation becouse only for few value of x we have K(x)<log(x) but I think this can be a bad approximation in the real world.
    Roughly speaking what I mean is that if I have a string with lenght N all possible string are mathematically 2^N but in the real world is wrong , if I have n=10^6 is not possible to think that we have 2^(10^6) possible strings (“Only math nerds would call 2^500 finite” – Lenoid Levin ).
    It is clear that if we don’t know the value of a string s with size n this value is one configuration over 2^n but the available strings is a subset of all possible strings.
    From a probabilistic point of view writing 1/(2^n) is mathematically correct but we can reduce the set to the real available strings.
    From an empirical point of view is interesting to see some explanation to strange fenomena the best of them I think is the compression paradox.
    Mathematically the probability to compress something is exponential low instead all pc user know that compress something is high probable . Not only but the experience in using compression say that increasing the amount of data to compress the compression ratio increase ! in absolutely opposition to the probabilistic laws. Yes is possible to give a lot of explanations ( the data we use is bad … ) but I don’t like these .
    I try to make a different math formula for the distribution of string of lenght N .
    Instead of 2^N I try to use Min(2^N,M/N) to see what happen .
    In this formula I insert a limitation M that is the power of the system so watching on the distribution of this formula the number of available strings are exponential for small N but ( small file are incompressible ) but decrease for big N ( big files are more compressible ) .
    I think this empirical discrepance from the mathematical laws are not restricted only to the compression but is very diffuse .
    Another field is all the randomized algorithm . How can all the “randomized” algorithm applied to solve exponential problems work emprirically better than deterministic algorithm ?
    A special often “randomized” algorithm is the genetic programming that empirically work fine in opposition to the NFL theorem , by this theorem is very improbable to have a working algorithm given a problem over all possible problems but it work fine if you give a problem from the real .
    My opinion about this is that the “randomized” algorithm make some jumps over a mathematical distribution so can examine big part of real distribution without these jumps is not possible to examine a so big real distribution.
    For these reason I don’t like to approximate K(x) as log(x) becouse I think the subset of 2^N in the real world is made by a subset of element where K(x) is less than log(x) .
    Let me make another step.
    We have a mathematical space 2^N and a real space of Min(2^N,M/N) ( I don’t know if the second part M/N is correct but this is not the point ) what I think is that you can not cover all the real space also ( and this is the main reason) if this space is very small compare to the mathematical becouse also if this space is computable by some function and program , such program can be irreducible respect to some complexity and is not practical feasible to execute such program .
    “Randomized” algorithm let you to make these jump , impossible to do algorithmically (by speed and/or space limitations ) . With “randomized” I mean arbitrary real data and not random in the algorithmically definition .
    Following my think the data used by the learning algorithm have a fundamental role to speed up the algorithm , used to reach such states impossible to reach without these jumps.
    So the extradata used by the learning algorithm is used also without specializing the algorithm for a specified field but in the general case.
    My think is very closed to Jurgen Schmidhuber in “The Speed Prior” but with different objective
    I focus my attention to such cases where computation reach its limits but I think this happen often.
    I think can be interesting to test empirically what happen changing some formula for example the classical universal distribution to a limited universal distribution defined by Sum(Max(2^-L(Pi),L(Pi)/M)) where Pi is the program output x .
    I hope I captured your attention.

    Denis.

  15. sp says:

    Shane:

    I looked at the Hutter’s paper that you recommended, it was quite useful, it also helped me to understand what you were saying. Basically, he explains (a paragraph after Eq.13) in what sense the universal prior is unique. Endeed only f(x)=x+c would make sense. I did not fully appreciate dominance of any simple f(x) over K(x) in the context of prefix codes. (a paragraph before Eq.11). You were right that lengthening the description does not help, while shortening (like taking f(x)=x/2) would not be possible, but only for simple f. This needs to be stated explicitely, otherwise one cannot claim uniqueness of the prior.

    That was helpful, thanks. I guess the only remaining question would be why to require K(f)=O(1)? I guess the answer would be the Occam principle. But that feels acceptable.

    >> Again, why not K^2(\mu)+c, which would correspond to f(x)=x^2+c.

    >What does 2(\mu) mean?

    LaTeX notations, K(\mu) squared. Irrelevant now.

    Thanks!

  16. Shane Legg says:

    sp:

    > I guess the only remaining question would be why to require K(f)=O(1)?

    It means the function has a finite code length. In other words, f is computable.

  17. Shane Legg says:

    Denis:

    You have quite a long post, however I seem to disagree with some of your premises. So I’ll start with these:

    > In many lectures I made related to Kolmogorov complexity I found very often the usage to approximate the K(x) with log(x). From a mathematical point of view this is a good approximation because only for few value of x we have K(x) Roughly speaking what I mean is that if I have a string with length N all possible string are mathematically 2^N but in the real world is wrong, if I have n=10^6 is not possible to think that we have 2^(10^6) possible strings.

    I disagree with this. I can’t write down all of the 2^10^6 strings. However, I could easily write down any one of them. It would take a few hours, sure, but I could do it. As such, I don’t consider any of these strings to be “impossible” in some sense.

    > It is clear that if we don’t know the value of a string s with size n this value is one configuration over 2^n but the available strings is a subset of all possible strings. From a probabilistic point of view writing 1/(2^n) is mathematically correct but we can reduce the set to the real available strings.

    Again I disagree with this for the reason above.

  18. sp says:

    Shane:

    > It means the function has a finite code length. In other words, f is computable.

    I see. So the point really is: whichever f(x) I choose, for complex enought models I will have the same convergence bound of the form ln(2)K(\mu)+c. And the notion of uniqueness is only meaningful in such an asymptotic sense.

    As for the weights for simple models, of which we have a finite number, we can choose those weights at our liking (as long as they sum to less than 1). I have to admit that, from a practical point of view, this does not seem terribly useful 🙁

  19. Shane Legg says:

    sp:

    I’m no longer sure what you mean by f in this context. The bound is a bound. It holds for all models.

    Regarding uniqueness, you’ll need to be more explicit about what you mean by this. In terms of definition, once you’ve picked your reference machine the prior is uniquely defined. There is no longer any +c etc, and the bounds hold exactly.

    If we decide on a reference machine, then “simple” models (meaning simple with respect to some other simple reference machine) cannot have arbitrarily high complexity to us (i.e. with respect to the reference machine we are using) due to the invariance theorem. So we can’t assign weights in any way we want.

    From a strictly practical point of view none of this is computable. Solomonoff induction is very much in the land of theory: mathematically simple model with amazing generality and rate of convergence, but not realisable.

  20. sp says:

    Shane:

    > I’m no longer sure what you mean by f in this context.

    At first f was a function of the program length, then it was a notation from the Hutter’s paper (page 9). Either works for discussing arbitrariness (to some extent) of weights for a finite number of models.

    > Regarding uniqueness, you’ll need to be more explicit about what you mean by this.

    See the paragraph before Eq.11. I would like to understand how much f(x) can deviate from K(x). You have suggested that any f(x) (computable f, satisfying Kraft’s inequality) will do.

    > In terms of definition, once you’ve picked your reference machine the prior is uniquely defined. There is no longer any +c etc, and the bounds hold exactly.

    I may very well be wrong here, but it seems to me that the difference between switching the reference machine and leveraging f is in the size of the constant in the convergence bound. I mean the following: in case of f, I could make c in the convergence bound be c>>K(f) (say, K(f)=10^3, and c=10^{10^3}) . While when changing the reference machine, the constant c cannot be much larger than the description length of the new machine.

    > If we decide on a reference machine, then “simple” models (meaning simple with respect to some other simple reference machine) cannot have arbitrarily high complexity to us (i.e. with respect to the reference machine we are using) due to the invariance theorem. So we can’t assign weights in any way we want.

    Is it possible to make the weights of simple models to be 2^{-10^{10^3}} using f of complexity O(1)? The meaning of notations f,O is the same as in Hutter’s paper. From what you have said I concluded that this will still fit the notion of the universal prior.

    > The bound is a bound. It holds for all models.

    Yes of course. I just ment that the linear dependence on K in the bound becomes apparent for complex models. For simple models it is shadowed by c. So I was using “complex enough” for K>>c.

    Summarizing: my question is about the size of c in the convergence bound. Can c be made very large (much larger than O(1)), by using f of K(f)= O(1)? Or must the size of c be comparable to K(f)?

  21. Shane Legg says:

    sp:

    I think the best answer to your question might be to show you how the Solomonoff convergence result works 🙂 Not all the details, just the key steps. Have a look at a couple of pages from my PhD thesis:

    http://www.vetta.org/documents/SolIndPhDThesis.pdf

    I left the first page in as intro, but it’s the second page that we really care about. Above the second full equation we have the dominance property:

    ∀x ∈ B^∗ : ξ(x) ≥ 2^{−K(μ)} μ(x)

    Then looking at the equation below, you can see how this property allows us to arrive at the Solomonoff bound. So, now that you can see this, you should be able to see how messing around with the weights in the mixture ξ affects the resulting bound. You can still get some kind of a bound (assuming your weights don’t actually break the mixture, e.g. make it sum to more than 1).

  22. Benjamin says:

    Shane,

    how do you think about set theory as the/a basis for mathematics? When we abstract from action, perception as sets we are committed to atomistic action & perception, as well as the seperation of perception/action from “hardware”. Isn’t there a need for hierarchies? I think the problem of AI is very similar to understanding music. Music consists of pieces, pieces of movemements, movements of themes, themes of notes, notes of waves. However there is no ultimate reality: music doesn’t consist of waves, but rather every level is essential, and there are several dimensions (harmony, rhythm, melody, orchestration, performance, …). The professor under whom I have studied has solved the problem by using category theory.

  23. sp says:

    Shane:

    Thanks, I checked the link, I was slightly familiar with the derivation of the Solomonoff bound. Maybe I should reformulate my question.

    I knew that the Solomonoff prior has some nice properties: 1) it only weakly depends on choice of a reference machine, 2) its convergence is bounded by the Kolmogorov complexity of a distribution from which strings are sampled.

    If just these were the claims about the Solomonoff prior, I would not be wasting your time. But something else is often said, namely that the Solomonoff prior solves the induction problem in an optimal way. Similar statements are made about the Levin search and AIXI. I think that whenever someone talks about optimality, one should be able to give an unambigous criterium of what is optimized (which means that there is a quantity which is minimized).

    My question is: what is this criterium?

    Thanks for your patience.

  24. Shane Legg says:

    sp:

    Good question. I’m not sure if there is a really good answer. The problem is that “optimal” can mean various different things. Take sequence prediction, for example. In some sense the optimal predictor for a sequence sampled from some distribution μ is to use μ itself to predict. This will give you the best predictions in various senses. Of course, that’s cheating: you’re “solving” a learning problem by simply putting the solution in to start with.

    Ok, so if a system actually has to learn what the generating distribution is through observation, then at the very least the bound on the number of errors that the predictor makes is going to contain enough information to describe what μ is. It’s the best we could hope for in any system that’s not cheating. Solomonoff induction essentially achieves this bound, and so in this sense is “optimal”.

    Although this shows that Solomonoff induction is “optimal” in terms of maximally exploiting all structure in the observed sequence in order to make predictions, you could define “optimal” in some other sense. For example, you might want a predictor that makes the best possible predictions for a given observed sequence AND available computational resources. As Solomonoff induction is uncomputable it’s certainly not optimal in this second sense.

    So in short, “optimal” can mean various things as it all depends on what measure of performance you are trying to be optimal with respect to. Thus, in each case you need to look at the technical results to understand exactly what “optimal” means.

  25. Shane Legg says:

    Benjamin:

    People have tried to found mathematics on various things, including various kinds of set theory, basic arithmetic and logic. Personally, I just see mathematics as a collection of relations. Thus, there is no true foundation: you could construct a foundation out of any simple system of relations that was consistent and sufficiently rich.

    I don’t understand what your question is in the second half of your comment.

  26. sp says:

    Shane:

    Thanks, at least I know that an answer to the question, that is bothering me, is not obvious. I, however would stick to the idea that “optimal” should be well defined, given a criterium. There can be different criteria, and correspondingly different optimal solutions – that is fine. Given a criterium, there can be multiple solutions, which are equally good – that is fine. But I perfer to think that “optimal solution” implies a well defined criterium. Your example of an optimal measure for predicting sequences is totally legitimate, though very simple.

    When you say that the Solomonoff bound is optimal, you probably mean what is called Pareto optimality: that it cannot be improved for any given model without degrading the bound for some other models. Such optimality also sounds fine to me.

    My only problem is, again, that I could do reweighting, which would result in a prior that is as good as the Solomonoff prior, from the Pareto optimality perspective. Here we are back to square one 🙁

    Anyway, it was good to confirm that I am not missing something way too obvious. So it seems.

    Thanks!

  27. sp says:

    I thougt a bit more about a criterium for a universal prior. It seems to me it is sufficient to demand that: 1) the universal prior should exhibit an optimal convergence bound (as you explained), 2) and it should maximally enforce the Occam razor in the sense that it should assign as much weight as possible to the simple models, while respecting the condition 1. This, I think, leads unambiguously to the Solomonoff prior (upto a choice of a reference machine).

    I think it is important not only to explain “how”, but also “why”. So I feel much better now, I feel I can now explain “why”.

    Thanks a lot Shane!

  28. Shane Legg says:

    sp:

    > I, however would stick to the idea that “optimal” should be well defined, given a criterium.

    In informal discussion it’s typically not well defined. However, in the original papers, which is where you should look for any non-trivial questions, these things are very well defined — they are stated as precise mathematical theorems. So, for example, if you want to understand in what senses AIXI is optimal (actually AIξ to be more precise and use Hutter’s notation… AIXI is slightly different) you should see chapter 5 of Hutter’s book, in particular Theorems 5.23, 5.24, 5.29, 5.32, 5.34 and 5.36. These prove various things, including Pareto optimality and the much stronger condition of balanced Pareto optimality.

    > My only problem is, again, that I could do reweighting, which would result in a prior that is as good as the Solomonoff prior, from the Pareto optimality perspective. Here we are back to square one

    Perhaps one thing to keep in mind: Solomonoff doesn’t prove Occam’s razor. He takes it as a foundational principle and formalises it. You also can’t prove Occam’s razor from Pareto optimality alone, as evidenced by the fact that you can still rebalance. In other words, if you reject Occam’s razor, nothing in Solomonoff induction is going to convince you to change your mind.

  29. sp says:

    Shane:

    > if you reject Occam’s razor, nothing in Solomonoff induction is going to convince you to change your mind.

    I never meant to suggest that the Solomonoff induction can be understood without Occam’s razor. And I think we agree that the universal prior is made unique (without resorting to an expilicite definition of Solomonoff) by: choice of a reference machine, Occam’s razor, Pareto optimality.

  30. ben says:

    Thanks for your comment. I agree that there is always a mapping from one category,
    i.e. in my interpretation a sufficient basis, to another. However there is an inherent weakness with sets. A set is a collection with a name, but the members are anonymous. One can of course easily construct a tree, by adding relations (vertices in this case). When using sets, most problems appear to be that of right selection (all options are well-defined and known apriori). However in my opinion finding a solution is more about finding the right representation. I think of it as encryption: if somebody gives you the key (the solution) the runtime is a small constant. So finding the formula e=m*c^2 is easy once you know it. I don’t know how this coincides with your approach.

  31. Shane Legg says:

    ben:

    I don’t understand what you mean when you say that the elements of a set are anonymous. You can represent integers, real numbers, etc. all using nothing but sets. You also seem to imply that trees are more powerful than sets. However, it’s easy to represent any tree as a set — you simply nest sets inside sets. For example, you could represent a simple tree with a root arity of 2 and a left subtree arity of 2 as the set { {{},{}}, {} }.

  32. ben says:

    In general there is no class/instance operator. Set A equals set B if all elements are equal. Further the logical contruction is not nested or rather ‘not intented to be nested’. An alternative is to allow hierarchal constructions including the powerset operator (called product) , but also conjunction (called limit) and disjunction (called colimit). In object orientated programming this is well known as a class with other classes as members, a list as a member, and so on.

    To give an example, a note is a conjunction of onset (IR), pitch (IN), duration (IR), loudness (IR), voice (IN). It is not a set, because every member is needed.

    It’s not a question, but rather a concern with AI, especially it’s use of utility theory, which I am concerned about. I wonder whether a complexity approach is a solution for ‘my’ problem so I am looking forward to read your PhD thesis.

  33. Shane Legg says:

    ben:

    I still don’t understand what you’re talking about.

    > In general there is no class/instance operator.

    Sure there is. A set can represent a class and an instance is just an element of this set.

    I also don’t see why a note is not a set. A note must have these for properties. Ok, just define the set of all notes to be the set of all things that have these four properties.

    The whole of math can be founded on set theory. If you can express something mathematically, you can express it with sets.

  34. ben says:

    I don’t know in which way the following is true, but at least according to wikipedia “certain categories called topoi (singular topos) can even serve as an alternative to axiomatic set theory as a foundation of mathematics. These foundational applications of category theory have been worked out in fair detail as a basis for, and justification of, constructive mathematics.” I have not been able to find out whether this can be proven or what the argument exactly is. What I gave above was an informal, incomplete definition of a topos.

  35. ben says:

    Sorry if this seems as a thread hijack, but this is my issue with the universal intelligence measure.

  36. Shane Legg says:

    ben:

    > certain categories called topoi (singular topos) can even serve as an alternative to axiomatic set theory as a foundation of mathematics.

    Sure. People have built foundations for mathematics on all sorts of things. Many of these, if they are sufficiently rich, end up producing classical mathematics. If you choose to limit the power of the axioms, for example intuitionistic logic, then you can get some version of constructive mathematics.

    Thus, I’m not at all surprised to hear that you can found mathematics on category theory, and I’m sure that if you make the axioms strong enough you’ll end up with classical mathematics — just like you get from some versions of set theory. (Actually I should say “some version of classical mathematics”, for example, do you or do you now include the axiom of choice (or equivalent statements in alternate axiomatic foundations)? Probably yes, but some argue otherwise.)

  37. Benjamin says:

    I don’t know about my construction argument, however Grothendieck’s Axiom provides “the ability to do category theory, which requires huge sets (inaccessible cardinals) larger than those postulated by the ZFC axioms. Grothendieck’s Axiom postulates the existence of such sets.” http://us.metamath.org/mpegif/mmset.html
    http://us.metamath.org/mpegif/ax-groth.html

    “Doing mathematics in a categorical framework is almost always radically different from doing it in a set-theoretical framework.” But also: “Still, it remains to be seen whether category theory should be “on the same plane,” so to speak, as set theory, whether it should be taken as a serious alternative to set theory as a foundation for mathematics, or whether it is foundational in a different sense altogether.”
    http://plato.stanford.edu/entries/category-theory/

  38. Denis says:

    > disagree with this. I can’t write down all of the 2^10^6 strings. However, I could easily write down any one of them. It would take a few hours, sure, but I could do it. As such, I don’t consider any of these strings to be “impossible” in some sense.

    This is my main claim , I think it is impossible to write down any one of them .
    How can you write down any one of them ?
    There is not a resource bounded program to do this.
    There is not a resource bounded program able to write down any one of that strings.
    Every existing (not every possible programs ) can cover only a part of that set of strings.
    Ok this is an extreme deterministic (I am deterministic ) point of view but I think it is possible to have the same result in a probabilistic domain.

  39. Shane Legg says:

    Denis:

    >How can you write down any one of them ?

    Of course I can write one of them down. It’s a string of 1’s and 0’s that is 10^6 = 1 million digits long. I just did a count and my thesis is just under half a million characters long. In ASCII binary that’s a string of 4 x 10^6 bits, and I typed it all in myself.

    Ok, but let’s say that you just didn’t pick a big enough number. What if we considered the set of binary strings that are 10^1000 bits long, of which there are 2^10^1000 strings?

    A string this long I can’t write down by hand, or with a machine. It’s too big. Ok, so why does this matter to somebody building an AI? For example, under a universal prior distribution the prior probability assigned to a randomly chosen string of this length is going to be on the order of 2^-(10^1000) which is extremely extremely close to zero. Technically they are not treated at impossible, but in any practical sense they are treated as though they are essentially “impossible”. In which case, how does your idea of considering these strings to be “impossible” make much difference?

  40. Denis says:

    >Of course I can write one of them down. It’s a string of 1’s and 0’s that is 10^6 = 1 million digits long. I just did a count and my thesis is just under half a million characters long. In ASCII binary that’s a string of 4 x 10^6 bits, and I typed it all in myself.

    Yes you can do it , you can write down a subset of all available strings ( and this the maxmum available ) .

    >Ok, but let’s say that you just didn’t pick a big enough number. What if we considered the set of binary strings that are 10^1000 bits long, of which there are 2^10^1000 strings?

    I think this is about the same . I think that a string long 10^1000 does not exist.

    I explained something in my first post.
    I think the world is not exponential so we can not think to resource functions like O(2^N) .
    If I think to a system with limited power I can set a limit for example to the size of the strings .
    If I place this limit M a starting consequence is that the space distribution is not 2^N but Min(2^N,M/N) .
    From here we can change the universal distribution as I suggest for an interesting empirical test and others formula.
    I show other consequences in the first post…
    In a strictly practical point of view what can we do knowing this ? This is the most important question.
    I think that this consideration emphasizes the role of existing data .
    The existing real data let us to make such “jump” to “good” machine states impossible to reach with computations

  41. James Bowery says:

    Shane,

    Congratulations. I submitted a /. story on your prize but maybe I didn’t do a good enough job of explaining the value of your work to catch the attention of the reading public. “Machine Super Intelligence” is certainly catchy enough, one would think but we’ll see.

    My question is: What do you make of Solomonoff’s statement:

    “This subjectivity, the fact that they are based on choice of which Universal machine to use, is characteristic of all prediction systems based on a priori probability distributions. The choice of Universal machine and its instruction set is a necessary parameter in the system that enables us to insert a priori information into it. The dependence of the universal distribution on choice of machine is not a Bug in the System — it, too, is a Necessary Feature.”

  42. Shane Legg says:

    James:

    After going to so much effort to write the thesis I figured I may as well give it a push with a flashy title! I’m curious to see if it attracts much attention. My guess is that it will largely be ignored… until 10 or 15 years from now.

    Regarding reference machines. I take a different perspective. I think we should keep the reference UTM as simple and clean as possible. If we want to put prior knowledge into the inductive inference system, then we can just provide this as input prior to our training data. Any “prior knowledge” that goes into the design of the reference machine, should be only our most fundamental assumptions about the nature of the universe. Such as the Church-Turing thesis and Occam’s razor. Obviously we need to put enough in there to make induction possible, but we should not put in any more. From that point onwards we should “let the data speak for itself”, as the statisticians say.

    I’d also like to say that I think that the problem of choosing the reference machine is overplayed by many people. Remember that the bound in prediction errors in Solomonoff’s result is ln(2)/2 K(\mu). Thus, if we switch from one simple reference machine to another the bound only moves by a few hundred bits, maybe a few hundred bytes, due to the compiler constant in the Kolmogorov complexity function. Unless you have a tiny data set, that’s still a very powerful bound.

  43. Rick Schwall says:

    Dr. Legg,

    Congratulations on achieving the rank of Doctor of Philosophy! It’s good to know that the Philosophy of Machine Intelligence will get good health care.

    Also, congratulations on earning the SIAI academic prize.

    I am a contributor to and volunteer for an organization concerned about the future of mankind. I have agreed to their request to not say or imply that they are associated with me. I am between work assignments, and I would be willing to assist you in any way I can, from manual labor on up. I have a Ph.D. In chemistry. If you will send me an e-mail address, I’ll send you a resume. My e-mail address is r -~ -~ schwall -~ -~ verizon -~ -~ net (I wonder if the address-harvester bots have gotten smart enough to recognize that?). Even better, call me at 1 wait 805 pause 890 breathe 2514 whew! (USA, California).

    I’m NOT seeking paid employment. I only work for free these days.

    Just so this is not completely off topic, I do have an AI-complexity question. (You’re welcome to tell me this is a newbie question and send me off to read a textbook.)

    My ‘understanding’ is that Gödel’s Incompleteness Theorems depend on the integers being an infinite set…

    OK, I admit, I don’t understand this well enough to ask a question that is provably not stupid.

    So, in vague, imprecise English: is your new K-dot (sequence complexity) completely dependent on infinities? For example: the integers go on forever?

    Suppose we admit that we’ll never need the integers out past 10^(10^6)? Our Friendly AI’s low-level algorithms will return an error code if a million-digit register overflows. (And no, we won’t try to process your dissertation / book as a single string.) Does this cast new light on the fit of K-dot (and maybe even Gödel incompleteness) to physical reality?

    Suppose we admit that our Friendly AI will be using floating point arithmetic (gasp!) a lot of the time?

    Suppose we admit that any infinite set (such as the natural numbers) is a model, a construct in human language, that does not exist in physical reality? That has been a really convenient tool for a lot years, but maybe we’ve reached an area where something else will do better.

    The past introduction of rational numbers, real numbers, imaginary numbers, etc. Reminds me that our constructs have a limited match with reality. Newton’s laws, which are still used in most engineering, are known to be inaccurate, but are “close enough” except in certain domains (the very fast, the very tiny, etc.). Maybe the realm of limited-register computing is a new domain.

    I _am_ disappointed that we may not be able to mathematically prove that a given AI architecture and mission is safe and will stay safe. However, that doesn’t mean I’m going to just quit in despair. The AI crisis won’t go away just because we can’t map something onto a two valued, proof / disproof, world view. Get Bayesian.

    We still need to learn all we can about making safER AI. We’ll need approximation methods. We’ll need to be able to compare architectures and components and choose the less dangerous ones.

    Relinquishing all technologies that lead toward advanced AI just won’t happen. (Unless we crash our complexly-interwoven civilization.) If recursively self-improving AGI is possible, it is inevitable. HUMANS will make those decisions. (Go ahead, prove that humans are Friendly, safe or sane.)

    Besides, we NEED the help of trans-human, hyper-moral, not-evolved AI. I can say with ‘certainty’ (well past 5 nines, or 15 bits) that Natural Intelligence (AKA “politics as usual”) is NOT SAFE. Especially not when armed with huge nuclear arsenals that can probably make us extinct. Just for starters.

    And the upside potential of trans-human AI is unimaginable! (At least it is to my lowly level 126 intelligence.)

    What? Hmm? Oh, yes, you’re right, that is a soap-box I’m standing on. Heh.

    Give me a call. I’m REALLY committed to participating. I’ve been putting in 20 – 45 hours per week, but I may not be well-focussed on that which will make a difference. I need a research advisor.

    Thanks for your time and your listening.

    With respect and high regard,

    Rick Schwall, Ph. D.

    ———————————–
    When humanity has ENGINEERED (NOT evolved) the first super-human hyper-moral Artificial Generall Intelligence, it will be Humanity’s Triumph over Evolution.

  44. Shane Legg says:

    > I am a contributor to and volunteer for an organization concerned about the future of mankind. I have agreed to their request to not say or imply that they are associated with me.

    Haha! Is being associated with you really so bad?!

    > My e-mail address is r -~ -~ schwall -~ -~ verizon -~ -~ net (I wonder if the address-harvester bots have gotten smart enough to recognize that?).

    My email is shane@vetta.org I don’t care about the bots, they’re no match for gmail.

    > I’m NOT seeking paid employment. I only work for free these days.

    Hopefully someday I too will be able to work for free. For the time being peanuts are accepted.

    > So, in vague, imprecise English: is your new K-dot (sequence complexity) completely dependent on infinities? For example: the integers go on forever?

    Math in general depends on infinities: it’s a mess without them. Consider addition and subtraction. If x = y + z then x – y = z right?
    If you only have a finite number of integers then that no longer works in general as y + z might be larger than the largest integer permitted. When addition fails, so too does K-dot and most other things.

    In many simple applications the fact that computers can only approximate mathematical numbers this isn’t a problem as the numbers don’t get too extreme. But in many numerical applications it’s a serious issue and all sorts of work has to go into making computations behave themselves.

    > Suppose we admit that we’ll never need the integers out past 10^(10^6)?

    What do you mean by “need”? Do you mean that we can implement an AI with normal double precision floating point numbers? Or do you mean that we should consider 10^(10^6) + 1 to not exist?

    > Suppose we admit that any infinite set (such as the natural numbers) is a model, a construct in human language, that does not exist in physical reality? That has been a really convenient tool for a lot years, but maybe we’ve reached an area where something else will do better.

    I think the problem is that you’re confusing the model with reality. In physical reality the number 3 doesn’t exist. Have you ever seen a 3? You may have seen 3 bananas. You may have seen a symbol “3” that
    represents the concept. But you’ve never actually seen a 3. It’s an abstraction, not a real thing. The same goes for infinity.

    Many traditional cultures didn’t have the concept of negative numbers, fractions, zero, and in some cases they didn’t even have numbers beyond ten — after ten is was just “many”. This abstraction might work for counting your children, but it’s not much use for physics.

    > Maybe the realm of limited-register computing is a new domain.

    It doesn’t seem so new to me. Talk to anybody who works on numerical computation. Trying to get numerical simulations to work correctly despite the fact that floating point numbers have limited precision is a major problem they face. It’s a whole field of research.

    > Natural Intelligence (AKA “politics as usual”) is NOT SAFE.

    Yes, even if we avoid powerful AGI, long term I don’t think humans are sufficiently “Friendly” to avoid extinction. We’ll happily bring about the end of humanity and claim it to be god’s will, or some other nonsense reason.

    > Give me a call. I’m REALLY committed to participating. I’ve been putting in 20 – 45 hours per week, but I may not be well-focussed on that which will make a difference. I need a research advisor.

    🙂 I think I understand your general interests, but do you have a specific idea of what you want to achieve?

  45. Shane Legg says:

    Benjamin:

    For some reason your most recent comment disappeared into my WordPress spam filter. I just now noticed it there and fished it out…

    Discussions about alternative foundations of mathematics always seems a bit pie-in-the-sky to me. Classical mathematics has worked well for a long time. In order to seriously consider something else, I’d need to see a simple example of where using an alternative mathematics was better than classical mathematics for some concrete problem.

  46. Joe Harris says:

    Shane,

    Thanks for the reply re SQL.

    Having *huge* calculation speed could allow for brute force query evaluation techniques that are not practical on standard CPUs.

    I just wanted to point you at a presentation on the GPGPU.org site about this very topic. http://www.gpgpu.org/s2005/slides/govindaraju.DatabaseOperations.ppt

    Congrats on the book, looking forward to reading it.

    Joe

  47. Shane Legg says:

    Joe:

    Oh, wow. It hadn’t occurred to me that GPUs could be useful in databases…

Comments are closed.