While I was visiting Marcus Hutter at ANU a month or so ago, I got talking to one of his students, Joel Veness, who’s working on making computable approximations to AIXI. Joel has a background in writing Go algorithms so is perhaps perfect for the job. I saw recently that the Monte Carlo AIXI paper describing this work is now available online if you want to check it out.
The basic idea goes as follows. In full AIXI you have an extended Solomonoff predictor to model the environment, and an expecti-max tree to compute the optimal action. In order to scale AIXI down and still have something of roughly the same form, you need to find a tractable way to replace both of these two items. Here’s what they did: in the place of extended Solomonoff induction a version of context tree weighting (CTW) is used. CTW has to be extended for this application similar to the way Hutter had to extend Solomonoff induction to active environments for AIXI. In the place of the expecti-max tree search a Monte Carlo tree search is used, similar to that used in Go playing programs: initial selection within the tree, tree expansion, a so called play-out policy, followed by a backup stage to propagate the new information back into the model. You have to be a bit careful here because as the agent imagines different future observations and actions it has to update its hypothetical beliefs to reflect these in order for its analysis and decision making to be consistent. Then, once this possible future has been evaluated, the effect of this on the agent’s model of the world has to be unwound so that the agent doesn’t, in effect, start confusing its fantasies with its present reality.
The algorithm is both embarrassingly-parallel and any-time, which is very nice. In less technical language: it would be fairly easy to get it to run efficiently on a massively parallel supercomputer, and it also has the property that it can be forced to decide what action to take at any moment always returning the best action it had been able to compute so far. Thus, if you want a smarter agent, just give it more time and/or CPUs. Already they have shown that MC-AIXI can learn to solve a bunch of basic POMDP problems, including playing a somewhat reasonable game of Pac-man. It would be interesting to see what it was capable of on a supercomputer with ten thousand times the resources of their desktop PC.
A key question for future research is to make better sequence predictors, in particular to be able to identify more complex types of patterns in the agent’s history. I guess all sorts of machine learning techniques could come into play hereâ€¦ and possibly combine to produce quite a powerful RL agent?