AIQ

Some of you might remember the talk I gave at the 2010 Singularity Summit about Algorithmic IQ, or AIQ for short.  It was an attempt to convert the theoretical Universal Intelligence Measure into a working practical test of machine intelligence.  The results were preliminary, but it seemed to work…

It’s now over a year later so I guess some of you are wondering what happened to AIQ! I’ve been very busy working on other cool stuff, however Joel Veness and I have been tinkering with AIQ in our spare time.  We’re pleased to report that it has continued to perform well, surprising well in fact.  There was some trickiness to do with getting it to work efficiently, but that aside, it worked perfectly straight out of the box.

We recently wrote a paper on AIQ that was accepted to the Solomonoff Memorial Conference.  You can get the paper here, the talk slides here, and we have also released all the Python AIQ source code here.  It’s designed to be easy to plug in your own agents, or other reference machines, if you fancy having a go at that too.

If you’re not sure you want read any of that, here’s the summary:

We implemented the simple BF reference machine and extended it in the obvious ways to compute RL environments.  We then sampled random BF programs to compute the environments, and tested against each of these.  This can be a bit slow, so we used variance reduction techniques to speed things up.  We then implemented a number agents.  Firstly, MC-AIXI, a model based RL agent that can learn to play simple games such as TicTacToe, Kuhn poker and PacMan, but is rather slow to learn.  Then HLQ(lambda), a tabular RL agent similar to Q learning but with an automatic learning rate.  Then Q(lambda), a standard RL agent, and Q(0), a weaker special case. Finally, Freq, a simple agent that just does the more rewarding action most of the time, occasionally trying a random action.  There was also a random agent, but that always got an AIQ of zero, as expected.  The results appear below, across various episode lengths:

The error bars are 95% confidence intervals for the estimates of the mean.  As you can see, AIQ orders the agents exactly we would expect, including picking up the fact the MC-AIXI, while quite powerful compared to the other agents, is also rather slow to learn and thus needs longer episode lengths.  We ran additional tests where we scaled the size of the context used by MC-AIXI, and the amount of search effort used, and in both cases the AIQ score scaled sensibly.  See the talk slides for more details, or the paper itself.

This entry was posted in Uncategorized. Bookmark the permalink.

31 Responses to AIQ

  1. Kevembuangga says:

    Thank you Shane, very useful.

  2. Razorcam says:

    Thank you. The GPL open source code is a nice gift too. Let the AIQ contest begin !

    But what about “real world” agent constraints like cost, time, CPU and memory?
    If I want to compare the effective intelligence of two algorithms which can take maximum amounts of CPU and memory as parameters, it seems fair to give them the same amounts of CPU and memory.
    Or if I want to compare two intelligent entities which may or may not be computer based, it seems fair to give them the same amount of time and the same amount of other resources (measured in money for example).
    So it seems that a parameterized AIQ, e.g. AIQ(one hour, 100$), would be useful.

    • Shane Legg says:

      I actually wrote a few pages on these kinds of issues, but unfortunately there wasn’t space in the conference paper to fit it all in… but in short, I agree. If we want to compare whole agents (= algorithm + computer) then this isn’t an issue. But, if we just want to compare algorithms then we need to remove the computer resource element from the equation. Perhaps by fixing the hardware and time, as you suggest.

      Thus I think there are really two things of interest here:
      1) who can build a system with the highest overall AIQ?
      2) what algorithm can produce the most AIQ on a reference hardware platform?

      As AGI designers our job is part 2. But what really matters to society in terms of the singularity etc. is 1, where advanced algorithms and supercomputers combine.

  3. gwern says:

    Very neat.

    Two other thoughts:

    – What connection does this have to the other group of researchers measuring intelligence AIXI-style? http://lesswrong.com/lw/42t/aixistyle_iq_tests/

    – AIXI-MC can learn Pacman; can it learn a more prestigious game likes chess? Could one just run a pair of AIXI-MCs and teach them chess that way? (I suggested this offhand http://lesswrong.com/lw/8fb/why_an_intelligence_explosion_might_be_a/58×2 and someone sounded interested in actually trying that out.)

    • Shane Legg says:

      Hernandez and Dowe also use the Universal Intelligence Measure, among other influences, but then head off in a different direction. In my opinion they over complicate some things, and I disagree with other things they do and claims they make. So I’d sum them up as a related but also somewhat different approach. I compare the approaches a bit in section 5 of the paper.

      Joel’s certainly thought about teaching MC-AIXI chess, indeed Joel made one of the best chess agents that learns through self-play many years ago. Anyway, it seems unlikely that you could get MC-AIXI 1.0 to do this due to resource problems, among other reasons. Much better versions are, however, on the way. The CPU time and memory consumption have both recently been improved by a factor of 10. Plus he’s got a new more powerful version of CTW that does switching rather than simple Bayesian mixtures: http://arxiv.org/PS_cache/arxiv/pdf/1111/1111.3182v1.pdf And there is more good stuff in the pipeline.

      • Joel Veness says:

        As Shane mentioned, playing chess is way beyond the capabilities of mc-aixi 1.0. A more modest goal for an mc-aixi style approach would be to support a model class that could at least learn the rules of chess quickly. Maybe once this is achieved then one could consider getting a whole mc-aixi style system to play chess. It’s certainly something I will try for in the future if possible, but for now there are a lot more fundamental problems that need addressing.

        Also, regarding what happens under self-play with two AIXI approximations… I don’t know! The place to start would be with something very simple like tic-tac-toe or kuhn poker rather than chess though. Hopefully somebody tries it at some stage. Some earlier work related to this can be found in “Universal learning of repeated matrix games” by Poland and Hutter.

      • Eray Ozkural says:

        Isn’t MC-AIXI basically randomized levin search? If that’s so, it is unlikely that you can use it to solve any problem with solution complexity that exceeds the logarithm of the memory of your machine, regardless of any constant factor improvements, even on unrealistically fast machines. Constant factor improvements aren’t that important for any algorithm. Of course I’m not even talking about universal program codes or incremental learning which are also essential. Still it might be possible to solve conceptually simple problems like chess with non-universal learners. That would be interesting. I see that it’s been suggested to fine-tune the learner to play chess, but that sort of defeats the purpose, although “parameter mining” is very popular in machine learning field (almost everyone does it, why the results are unreliable). People have referred me to ILP research, which always did that trick: choosing primitives carefully. Well, if you choose them very carefully, you can solve about any problem.

  4. Mqrius says:

    So, who will be the first human to run ~1000 environments of AIQ measurements, and set a Turing-like baseline for intelligence?

    • Mqrius says:

      I was going to play around with it a bit myself, but there’s not a lot of documentation. Can you give a few pointers as to how I could set this up for user-interactive runs?

      • Shane Legg says:

        There is an agent called Manual.py in there. It’s pretty primitive, but I did use it to see how well I could do. I seem to recall that I was around 70… but I didn’t manage enough samples to get a good statistic (I spent a few hours doing it before I started going crazy). Anyway, if you’re used to using Python I’m sure you can hack up something better…

        The next question is whether this is a sensible thing to be doing. The thing is that the human brain is well adapted, by both evolution and many years of learning, to deal with the 3D world we find ourselves in. Here the space is much more abstract, more like little logical and mathematical puzzles. If we want an intelligence measure that is more inline with human experience, we’d need to plug in a much more complex reference machine than BF. In another project I’m working on, I am, in some sense, heading in this direction.

        • Razorcam says:

          I agree with Shane because the goals of IQ and AIQ are different.
          AIQ measures an agent’s ability to get rewards in unknown randomly generated environments. That is, the agent has absolutely no previous knowledge about these environments.
          That doesn’t mean that AIQ doesn’t measure your ability to acquire knowledge about an environment and then your ability to use this knowledge to get rewards in this environment. Because that’s the obvious thing to do in order to get a good AIQ and it is exactly what the perfect (and only approximately computable) AIXI agent does in order to get a virtually perfect AIQ score.

          In IQ tests (and in the Turing test) your preexisting knowledge of the human 3D universe and of the human culture (including human language) is so important that without it your score will be near zero. So IQ and Turing tests are not pure measures of problem solving skills, they are also measures of what you have already memorized about human culture.

          • Mqrius says:

            *Facepalm
            Sorry, I didn’t see the obious Manual.py there. Thanks for the heads up 🙂

            “The next question is whether this is a sensible thing to be doing.” Oh I agree that it probably makes no sense. However, logic puzzles are fun, I’m familiar with brainfuck (although that shouldn’t be required, of course), and I was just curious how this would work. Playing it manually gives a better impression of what the AIs are doing, and what is being tested.

            “In IQ tests your preexisting knowledge of the human 3D universe and of the human culture (including human language) is so important that without it your score will be near zero.”
            This is only true for some IQ tests/questions, I guess. Questions asking for the next number in a sequence are (usually) independent of culture. I do agree with the gist of your comment though.

            • Razorcam says:

              “Questions asking for the next number in a sequence are (usually) independent of culture.”
              I agree and I remember (but I read it a long time ago, so beware) that Hernandez has cited an algorithm which already outperforms humans in such IQ questions. Hernandez’s conclusion was that this algorithm was the proof that such questions do not measure intelligence very well, because current machines are “obviously” not as smart as us. It seems that Hernandez is more going in the “human culture” direction like the Turing test, rather than in the “pure problem solving” direction like the AIQ.

              Since you like “pure problem solving” puzzles you should try mine: http://www.razorcam.com/
              (it really looks like it could be an AIQ graphical human interface, though I built it a few years ago)

  5. Toby says:

    I’m not an AI expert but I have what I hope is an intelligent question: How do you “scale up” the goals? i.e. how do you move from puzzles upward to tasks closer to a human’s capability? Is this automatable?

    • Shane Legg says:

      I’m not sure I understand the question here.

      Many of the problems in BF based AIQ are already beyond human capability. Perhaps what you mean is to have problems that are more similar to the kinds of problems we humans typically face. E.g. make a cup of tea. To do this you’ll need to replace BF with a far more complex reference machine. I’ve made AIQ so that it’s easy to add new reference machines, unfortunately, any reference machine that’s close to the real world is going to be a huge job to program.

      • Toby says:

        I think you got it. Now I’m confused as to how to get from AIQ measuring “puzzles” to measuring “humanness.” I’ll think about it.

      • Julian Morrison says:

        Write a reference machine that hooks them into a Go/Baduk online server and has them learn and play against humans? Most of the work will already have been done by the Go server side. (Go rankings, when they stabilize, are fairly objective in the sense of predicting wins.)

        • Joshua Olson says:

          Someone finally mentioned Go!

          Personally, I’m much more interested in a universal ai learning Go than Chess…

  6. Kevembuangga says:

    Somewhat off-topic but I stumbled upon an “interesting” future problem for large scale AI : Embracing non-determinism.

    • Shane Legg says:

      Yeah, I know about him and this area of research…

      One of the problems they have is that, while stochastic hardware could be very fast and efficient in theory, in practice it’s not competitive due to the massive economies of scale in the current design architecture.

      Also, while I like the idea of doing AGI via induction on general program spaces in a theory kind of way… indeed it’s kind of Solomonoff like… I don’t see how it will scale far enough. Indeed, on all the benchmarks I know of this approach isn’t very competitive with other methods, to the best of my knowledge.

  7. David says:

    This is great stuff, thanks!

    Incidentally, what are your thoughts on the alternative approach presented by Hibbard[1]. He raises this specific objection regarding the measure:

    “It gives less credit for environments defined by longer programs even though they
    are usually more difficult for agents. Given arbitrarily small ? > 0, total credit for
    all but a finite number of environments is less than ?. That is, total credit for all environments greater than some level C of complexity is less than ?, whereas credit
    for a single simple environment will be much greater than ?. This is not the way
    we judge human intelligence.”

    [1] Measuring Agent Intelligence via Hierarchies of Environments http://www.ssec.wisc.edu/~billh/g/hibbard_agi11a.pdf

    • Shane Legg says:

      I argue against this in the Universal Intelligence journal paper, indeed, I think this error in thinking has caused quite a few problems in artificial intelligence. I got the same question a few days ago, so I’ll repost my reply below:

      The problem, I think, is that our intuitions about intelligence are based on human intelligence where certain things can be taken for granted… things that aren’t necessarily true with machines. Ask yourself this: could you ever find a human chess grand master who was totally unable to learn to play TicTacToe? No, that would never happen! Thus, when we meet a chess grand master, we take his/her ability to play chess as strong evidence that they can also do much simpler things like play TicTacToe. Thus, being able to do some really complex tasks implies the ability to do really simple stuff, like spotting the pattern in an infinite sequence of zeros.

      With a machine this is not true at all. Deep Blue can’t learn to play TicTacToe, let alone poker or anything else. So, for a machine, the ability to do something complex is tells us very little of its ability to perform well in simple environments.

      I think that this incorrect assumption, based on our human intuitions, has caused a lot of problems in terms of AI ever getting to produce real intelligence. It has made us try to make machines that are better and better at one very specific complex thing, like chess, rather than making machines with very general abilities and working up in a broad way… as suggested by our informal definition of intelligence. This is an example:

      Agent A is designed specifically for a very trivial environment, like the vacuum cleaner task introduced in Russell’s AI book, whereas agent B is designed specifically for a very complex task, e.g. navigating to the moon. Intuitively, I would say B is more intelligent than A, though both can only perform well in only one environment, however, according to equation (1), B’s intelligence is heavily penalised by the high complexity of the environment for which it’s designed for.

      I would say that B has less *general* applicability than A, and so is less intelligent. This is because, under a Solomonoff prior, tasks that A is made for are much more likely than going to the moon. Of course, if B can be made to also do the very trivial problems that A can do, while still able to go to the moon, then it would be more intelligent than A, as we would expect.

      So the whole emphasis is on being very very general. With a Solomonoff prior, that means that being able to do the simple stuff really matters if you’re going to have a general intelligence.

  8. Matt Mahoney says:

    The paper’s conclusion is interesting. Choosing a good universal Turing machine to generate samples that is both simple and expressive is a hard problem. I found this out in my attempt to design a universal data compression benchmark at http://mattmahoney.net/dc/uiq/ (which the paper references). At first, the empirical evidence is promising. Compression programs were ranked on my benchmark in about the same order as they were on real data. But later it became clear that my benchmark could be gamed by specialized programs. Sure, I could tweak the language to level the playing field for this case, but how many times would I have to keep doing this? What is the general principle by which I can say that a program tuned for my benchmark really is a good predictor for cases of practical interest? Is it possible to design a language that can’t be gamed?

    The problem is that very simple machines like Wolfram’s 3 color 2 state UTM or CA rule 110 don’t produce useful output for programs of practical length. It seems that languages needs an anthropomorphic bias (concepts like arithmetic and strings) to do that. Even BF has this bias. (What is fundamental about the increment operator?) I could also test on a wide range of simple languages, but then it’s no longer simple. It keeps leading us back to the Turing test. It takes human judgement to decide if something is intelligent.

    • Razorcam says:

      When I look at the above graph (AIQ and episode length) I have the following idea: if you design an AGI that expects randomly generated environment in a specific programming language L1, and you measure its AIQ by generating environments using the language L1 then using another language L2, I expect that both AIQ scores will get closer, the longer the episode length will be.

      Because I expect the AGI to try to model the environment with small programs written in L1 and as these models fail, the AGI will increase the complexity of the models written in L1 so that the predictions of these models will get closer to the results of the unknown environment written in L2. And it will get closer and closer, the longer the episode length will be.

      • Shane Legg says:

        Yes, for powerful learning agents this should be the case. There will still be a cost to be paid due to non-ergodic environments (i.e. ones where the agent must make the right decision and only gets one chance in an episode to do so).

        A better idea might be to allow the agent to pre-train on samples from the test. It’s similar in spirit, but now avoid the non-ergodic environments problem.

    • Shane Legg says:

      I think you make an interesting point here. Given a fairly “natural” reference machine the simple environments will be things like the constant environment, simple stationary deterministic bandits, maybe some non-stationary and/or stochastic ones, some simple MDPs and so on. Essentially, we have chosen the reference machine to give us this kind of a distribution. I’m sure our extended BF isn’t the only reference machine to have this property. At the other end of the scale when the programs are long, then the choice of reference machine isn’t so important: if an agent can deal with million bit environments then it can also handle simple reference machine changes of a few thousand bits. The difficult part lies between these extremes, which seems to be what you found. In this zone the reference machine invariance hasn’t kicked in yet, and the bias we have imbued the test with might not be the most “natural”.

      I think there are two perspectives on this. One perspective is that we shouldn’t be so human centric about AGI. Perhaps intelligence with respect to BF is still intelligence, it’s just not intelligence as we humans usually know it. From this perspective, “gaming” the test doesn’t really exist — you’re simply improving the agent by giving it more knowledge of the reference machine. As suggested in the paper, for powerful learning algorithms we can mitigate this by allowing the agent to pre-train on samples from the test in order to adapt to the reference machine.

      Another view is that we want to pick better reference machines, where by “better” we mean ones that give us a distribution that is closer to human experience. The problem with this, as I point out in the paper, is that it takes a lot more work. Nevertheless, even a partial effort in this direction could be worth doing. For example, having a reference machine where things take place in a Euclidean space would be a start. Some of the things I’m now working on are going in this direction.

      In the particular case of compression: the point of a Solomonoff prior is that it’s a good prior to use if you really don’t know *anything*. If you do know something, it’s not going to be the best choice. What you should do is take the Solomonoff prior and then condition it on a large amount of data from the domain. The posterior that comes out of that should then be a good starting point when facing a new file to compress. So, rather than a naked uiq, what you should do is first condition on real data to get a more “real” distribution. However, given that real data to compress isn’t a scare resource, there isn’t much point in doing this. You may as well just test on real data. So long as your test sample is large enough to avoid over fitting, and statistically representative, you’re in good shape.

      Unfortunately this doesn’t make so much sense in the context of UIM/AIQ as the agent must interact with the environment and so obtaining real experience is expensive. Thus, if you want a human-like intelligence test, building a more human-world-like reference machine might be the way to go.

  9. John Middlemas says:

    You mention that the upper limit (of intelligence?) is AIXI. Any idea what the AIQ score for that would be?

  10. Pingback: Universal intelligence and biased generality | David Ruescas

  11. Pingback: “Human level AI” is a misleading concept | David Ruescas

  12. Pingback: Artificial Intelligence: If the world still fails to understand what true intelligence is, how does one expect to create artificial intelligence? - Quora

Comments are closed.