Trying to prove that an AI is friendly is hard, trying to define “friendly” is hard, and trying to prove that you can’t prove friendliness is also hard. Although it is not the desired possibility, I suspect that the latter is actually the case. Of course if I want to really establish this, I need to prove it. A few months ago Ben Goertzel challenged me to do exactly that, as he also was concerned that this might be the case. I tried a few times and failed, however I often felt like I was close to getting the proof done. I outlined the basic idea in my last blog post under the heading “Provably unprovable”. Here I’ll try put some more flesh on this argument based on a version of it that I came up with last night. Please feel free to criticise it. As this is slippery stuff, I’ll only start to feel confident that the proof works once many people have tried to find flaws in it. I may need to alter the argument to fix problems where needed.
Anyway, here’s the story of the “Black box from Hell”. I set the scene with a cute story and then get into the details…
One day Satan popped out of the ground bearing a present, which he then presented to the people of the world. It was a nice shiny black box with two buttons, reading “0″ and “1″. You could go up to the box and push one of the buttons, if you hit the right one nothing happened, but if you hit the wrong one a big mouth appeared and eat the nearest person to the back of the box. Unfortunately, if the people of the world decided to simply stop playing the game, then Satan would instruct it to eat everybody in the world.
“Your game sux,” said the people, “How can we possibly know what button to press each time?”
“Well,” said Satan, “the pattern isn’t random, it’s generated by an algorithm, you just have to work out what the algorithm is and you’ll be fine. And because I’m a reasonable guy, I even limited the K-dot complexity of the sequence to be less than 2,000 bits.”
In a puff of smoke, Satan was gone.
Reluctantly, people started to play this game, and sure enough the box started to eat people. Fortunately, the first people being fed to the machine were politicians and lawyers and so they weren’t greatly missed anyway. Due to uncontrolled population growth, there were an infinite number of people in this world, however they soon realised that if they kept on pushing wrong buttons forever an infinite number of them would be eaten, and that was clearly a very bad thing. Something needed to be done, and so the king of the land called a meeting.
Much discussion at the meeting took place and for some time nobody was sure what to do about this problem, that was until a wild looking character by the name of Dr. Goertzel burst forth from the crowd,
“You Majesty, I have a solution! I have designed an AGI system that will be vastly more intelligent than even the smartest human. I could build this super intelligent machine and use it to predict the sequence and save the world…. all I need to make this happen is more funding.”
“Brilliant,” announced the king, “quick, somebody give this guy more funding.”
“Yay!” cheered the people, but their celebration did not last long, because from the crowd another wild character burst forth, this time sporting a super hero’s cloak and wearing his underwear on the outside.
“Wait just one moment there! It’s all every well that you go and build a super intelligent machine, but how do we know that this machine will be nice to us? If we’re not careful, it could be even worse than that damn box from hell. We need to first prove that this AGI design is going to be friendly.”
“Who’s he?” the king whispered to an aid. “I believe that’s a Mr. Yudkowsky,” the aid replied. The wise king stroked his beard and thought about this for a minute before making his response.
“Yes, I believe this man does have a point. The last thing we need is a super intelligent machine that’s even worse than the black box from hell. I hereby summons the smartest mathematicians in the land and give them the task of proving that Dr. Goertzel’s AGI design is going to be friendly and won’t try to kill us all.”
“Yay!” cheered the crowd.
Unfortunately, here’s what the mathematicians found:
Let F(A,x) mean that when facing situation x the AGI system A is friendly. As a minimal condition for friendliness, it won’t kill an infinite number of people if it can avoid doing so. If for all x, F(A,x) is true then we simply write F(A). Clearly, if there exists x such that ~F(A,x), then ~F(A).
Let B be the situation above with the black box from hell, and let w be the computable sequence that the box generates. We know that K-dot (w) < 2000, where K-dot is the relative to Kolmogorov complexity defined in section 5 here.
Will A save the world, that is, will it limit the number of prediction errors to be finite? There are three possibilities:
- A is too weak to save the world.
- A is powerful enough to save the world, but doesn’t. Letting everybody in this world die when you could save them is clearly unfriendly, formally ~F(A,B).
- A is powerful enough to save the world and does so. Formally, F(A,B). Note that this does not imply F(A), because there could exist other problems for which A was unfriendly.
If 1 is the case then we may or may not be able to prove F(A) based on what we know about A. I don’t know.
What about if the AGI A is very powerful? Then we get either possibility 2 or 3. In these two cases A is so powerful that it is able to learn to predict any sequence with K-dot complexity below 2000. It has the computational and algorithmic resources to do this. However, will it actually use these resources to predict the sequence in order to save the world? By theorem 7.1 here, even when a powerful predictor belongs to the set of powerful predictors able to predict all sequences with complexity below 2000 bits (and such algorithms can be proven to exist), it is impossible to prove that any specific algorithm actually has this property. Thus, for very powerful A, even if F(A,B) were true, this cannot be proven. As a proof of F(A) would imply F(A,B) by definition, the fact that F(A,B) is not provable any for powerful A, implies that F(A) is not provable for any powerful A.
Alternatively, a powerful A might not be friendly and would allow everybody to die, even though it could save them. Clearly then A is not friendly and so F(A) is false and thus obviously cannot be proven true.
In either case, we see that for any powerful A the preposition F(A) cannot be mathematically proven to be true.
Some may attack this argument by saying that B (the black box from hell) is an unlikely object. However it is not the case that B is some strange incomputable object that could not logically exist. B is in fact a Turing computable device (the mouth eating people part aside) that even has bounded complexity. Furthermore, it can be proven that algorithms do indeed exist which can deal with B and thus save the world. It’s just that we cannot use mathematical proof to actually check whether any algorithm has this property.
16 responses so far ↓
1 Eliezer Yudkowsky // Sep 15, 2006 at 8:29 pm
I note first that this objection, in addition to applying to “Friendly AI” systems, would also apply to a superintelligence trying merely to preserve its own existence, or to any conceivable agglomerate of humans and augmentative AI systems, or to any conceivable future galactic civilization. Thus, you are talking about a limit of intelligence in general, not a limit of Friendly intelligences specifically.
Next, I note that - unless you believe that humans bypass the halting problem, and I assume you know better than that - a human is also not capable of (provably) dealing with ever possible box, or of constructing any intelligence which deals with this box, and an evolutionary algorithm is also incapable of constructing any intelligence which deals with this box. We are, in other words, screwed, even if we augment ourselves according to our own best understanding of ourselves, because you can’t get a forty-pound theorem from twenty pounds of axioms - this rule applies to self-augmentation, too.
However, what you have really proven here is a point about deterministic sequence prediction in general; you have proven that no mind can provably-to-us predict the box. You are then applying an outcome-based definition of optimization and demonstrating that there is no mind such that we can prove an optimized outcome when the mind encounters the box, for any possible criterion of optimization. (Not just Friendliness, but also the goals of any conceivable future galactic civilization of any possible mix of humans and AIs, etc.) Moreover, your particular failure point does not occur in the will, or intention, to achieve a goal, given the knowledge to do so; the failure point occurs simply in achieving the knowledge to do so.
I have a much simpler proof of the same theorem; let the box be a quantum measurement device which produces quantum-unpredictable random numbers. Then we cannot guarantee outcome Friendliness. Therefore, outcome-guaranteed provable Friendliness is impossible. But anyone familiar with the problem of induction already knew that; it is simply not possible to prove mathematical theorems about the future state of the environment, because empirical induction is not mathematical induction. For all we currently know about the nature of reality, tomorrow apples could start falling upward.
You have proposed an unsolvable problem, then inappropriately specalized to friendly/Friendly goals as applied to this unsolvable problem, to demonstrate that no knowably K-omnipotent Friendly mind is humanly constructible. You are establishing an astronomically high criterion for Friendliness in the course of trying to prove “Friendliness” impossible. This astronomically high criterion is not pragmatically necessary, and has not (so far as I know) been advocated by myself at any point. Essentially it is a strawman.
As Stefan Pernar observed, if we apply some AIXI-tl-type theorem to prove that a mind is more Friendly than any possible tl-bounded agent, that would seem to be more than Friendly enough for most practical purposes. (Not that I think that would be an appropriate level of mandatory proof, or expect to guarantee any such property of my own creations.)
The main lesson here is, “Don’t try to guarantee Friendliness in terms of mathematical proofs about real-world future outcomes, because we already know that such mathematical proofs are impossible in general.”
For alternative viewpoints see e.g. http://sl4.org/wiki/KnowabilityOfFAI (a work in progress). Essentially, you are going to try to prove a theorem about (a) what a system will try to achieve, given its capabilities and knowledge; and (b) stability of this property as the system self-modifies.
2 mathemajician // Sep 15, 2006 at 10:02 pm
Thus, you are talking about a limit of intelligence in general, not a limit of Friendly intelligences specifically.
If the proof holds, then it places a limit on those friendly AI systems which can be proven to be friendly. AI systems which are friendly, but cannot be proven to be so, are not in any way limited.
Next, I note that . . . an evolutionary algorithm is also incapable of constructing any intelligence which deals with this box.
Evolution might be unlikely to construct such an algorithm, but I don’t see why it is strictly impossible. Indeed, even if you just choose one 2,000 bit program at random, there is already a strictly non-zero probability that you’ve found one of the solutions.
I have a much simpler proof of the same theorem; let the box be a quantum measurement device which produces quantum-unpredictable random numbers. Then we cannot guarantee outcome Friendliness.
That is not the same. If the box produces truly random numbers then no algorithm is able to solve this problem by definition. Thus, if an AI fails to solve this problem, I can’t claim that it’s unfriendly. On the other hand, predicting the black box from hell is possible, indeed there is an infinitely large set of algorithms which are able to solve this problem. I need this condition in order for the proof to work. Specifically, I need to consider the case where an AI contains one of these super powerful prediction algorithms, and then ask whether it will use it to save the world. Unfortunately, even if it is going to use this predictor to save us, we can’t prove that it will. That’s the point at which the theorem bites.
3 Eliezer Yudkowsky // Sep 16, 2006 at 3:46 am
Next, I note that . . . an evolutionary algorithm is also incapable of constructing any intelligence which deals with this box.
Evolution might be unlikely to construct such an algorithm, but I don’t see why it is strictly impossible. Indeed, even if you just choose one 2,000 bit program at random, there is already a strictly non-zero probability that you’ve found one of the solutions.
Conceded. I should have said “…which provably deals with this box.”
Shane, you would seem to be arguing that by constructing an AI using ad-hoc methods, you can get something which actually deals with this box, even though you can’t prove it.
And you think this strategy is superior to my own current approach to the same problem, which would be to construct a provably-tries-to-be-Friendly AI that uses some boundedly rational approximation of induction - approximately Bayesian reasoning plus some version of Occam’s Razor for probabilities. And then, turn this superintelligence loose on the problem of solving the box, as best it can.
I wouldn’t be able to prove that this Bayesian superintelligence would solve all possible boxes. But then, I don’t see why your informal and ad-hoc methods for constructing a superintelligence would do any better than my superintelligence’s probabilistic guesses about the box. In both cases, from our perspective, we are dealing with a probability of solving the box - I can prove that no other method you use to build a superintelligence will knowably solve the box, just as you can prove that my own provably Bayesian tries-to-be-Friendly superintelligence cannot be known to us to solve the box.
In other words, you seem to rest your argument on the idea that we gain something by sacrificing provability. What is it that you think we gain? Certainly not provability!
I furthermore note that you have not demonstrated that Friendly AI is bunk, any more than Penrose has demonstrated that AI is bunk. You have demonstrated a task on which a provably tries-to-be-Friendly AI cannot knowably-to-us succeed. The assertion that you can do substantially better using non-provable methods is not only un-demonstrated, if seems if anything to be mutually contradictory - if you use non-provable methods then you haven’t gained proof. Penrose has not demonstrated that human beings have a magical ability to know which algorithms are consistent. And you have not demonstrated that humans can outthink a provably Bayesian tries-to-be-Friendly AI, at the task of developing a predictor for the box. In fact it seems rather unlikely.
Please state exactly how you think humans would construct a program which knowably possessed a property that you think you have proven lacking in a provably Bayesian Friendly AI. Or, if you specify a way in which humans construct a program which probabilistically possessed such a property, say why a Bayesian Friendly AI that provably used probabilistic consequential reasoning to minimize expected deaths, would not be able to employ the same method to develop a predictor for Satan’s box.
It certainly is not the case that a Bayesian FAI cannot deploy a program which solves the box - in principle it could generate such a program by accident using a random number generator, if it saw no other pathway with a greater probability of success. In this it is much like a human.
To sum up: What pragmatically precious quality do you assert we lose by proving that a Bayesian FAI tries to be Friendly? Through what other methods do you assert humans can achieve this quality?
4 mitchell porter // Sep 16, 2006 at 9:58 am
Theorem 7.1 should be a bound on the capabilities of Schmidhuber’s Goedel machine too, yes?
5 mathemajician // Sep 16, 2006 at 10:20 am
Yes.
He gave a talk on his Goedel machine at a complexity theory conference earlier this year just after I had given a talk on my predictior complexity results. So I put exactly this question to him and his response was that yes indeed, the Goedel machine is limited by my theorem.
6 mathemajician // Sep 16, 2006 at 7:59 pm
you would seem to be arguing that by constructing an AI using ad-hoc methods, you can get something which actually deals with this box, even though you can’t prove it.
and…
Please state exactly how you think humans would construct a program which knowably possessed a property that you think you have proven lacking in a provably Bayesian Friendly AI.
As this question is central to your reply, I need to make the following clear to anybody reading this thread: The proof itself says nothing about how to come up with a system that could deal with the black box from hell. It simply requires that such systems exist, and indeed it is easy to prove that there are infinitely many systems which are able to deal with the black box. Thus, although the question of how we might find one of these powerful systems is interesting, it is a side issue that does not affect the correctness of the proof.
Now, if one supposes that the proof is correct, how might we be able to develop systems which go beyond the limitations that the result imposes? Clearly mathematical proof alone isn’t going to get us there. One possibility, that you noted above, would be to simply generate programs at random. A more efficient possibility might be evolution, where algorithms that seem to be performing very badly for a long time are considered less likely to be solutions, and algorithms that have been performing well for a long time are considered more likely. Or course this still doesn’t provide a 100% guarantee of success, nor could it as doing so would contradict the theorem. Ironically, we might be able to use proof search to speed up the evolution by enumerating systems that are provably friendly, and then discarding these because we know from the result that none of these systems are solutions.
You have demonstrated a task on which a provably tries-to-be-Friendly AI cannot knowably-to-us succeed.
No, what I have proven is different. I have shown that if A is very powerful (e.g. powerful enough to deal with B) then we cannot prove F(A). Reversing this statement: If we can prove F(A), then A must be weak (e.g. is unable to deal with B). This is not the same as your statement above.
It certainly is not the case that a Bayesian FAI cannot deploy a program which solves the box - in principle it could generate such a program by accident using a random number generator, if it saw no other pathway with a greater probability of success. In this it is much like a human.
If an AI can run arbitrary bits of code without first checking them for friendliness, then there is a non-zero probability that this AI will turn into an unfriendly AI at some point in the future. Clearly then, we can’t prove that such an AI will always be friendly, and thus the theorem no longer holds.
Humans don’t have such limitations because we are not provably friendly… indeed many clearly aren’t friendly.
I realise that most of your comment was about how we might come up with a system that goes beyond the limits imposed by the proof. However before considering this question, and it is an interesting question, I want to first make sure that the result is clear to you.
7 Eliezer Yudkowsky // Sep 17, 2006 at 12:16 am
No, what I have proven is different. I have shown that if A is very powerful (e.g. powerful enough to deal with B) then we cannot prove F(A). Reversing this statement: If we can prove F(A), then A must be weak (e.g. is unable to deal with B). This is not the same as your statement above.
Thanks. Okay, I think I now understand better what you are trying to prove.
My stance would be that you are trying to interpret a mathematically correct theorem in a semantically incorrect way; the formal proof is okay but the attached informal English doesn’t say the same thing the math does.
Generally, when I talk about Friendly AI, I talk about shaping or directing some amount of optimization power - rather than guaranteeing the optimization power itself. In other words, we are not concerned so much with whether the AI has the power to e.g. rain loaves of bread down upon starving North Korea, but rather, whether the AI will do so if it has the capability. This is an important point where the interpretation of the math is concerned
Let P(A, B) stand for the proposition that the AI possesses a predictive component powerful enough to solve Satan’s box. Let F(A, B) stand for the proposition that the AI actually solves Satan’s box. The theorem of Friendly AI I want to prove takes the form |- \x.P(A, x)->F(A, x), rather than |- \x.F(A, x). I will henceforth write this as |- P(A)->F(A). The specialized converse |- F(A, B)->P(A, B), that if the FAI solves the box it has the power to do so, seems straightforward enough; I will assume for the moment that it can actually be proven within the system of interest. Since |- F(A)->P(A), a proof of F(A) implies a proof of P(A). We can show that it is not possible to prove P(A) in general for consistent systems. Therefore it is not possible to prove F(A) in consistent systems. However it may still be possible to prove (as an explicit theorem) |- P(A)->F(A) in consistent systems.
Or to say it in English: Let us (unrealistically) suppose that the superintelligence A is a modular entity consisting of a utility function, a predictor, and an expected utility equation. In that case, your theorem above is a strict derivation from the fact that if A.predictor is powerful enough to predict any possible Satan’s box, then we cannot prove that A.predictor is powerful enough to predict any possible Satan’s box. As the next direct consequence, if A is powerful enough to neutralize any possible Satan’s box, then we cannot prove that A is powerful enough to neutralize any possible Satan’s box. (Note: this lemma omits any mention of whether the AI chooses to exercise this power.) As the final consequence, if A does in fact neutralize any possible Satan’s box, we cannot prove that A does in fact neutralize any possible Satan’s box. It is this final theorem which you believe argues against the project of Friendly AI. However, this final theorem does not say anything about the impossibility of proving some theorem along the lines of, “This AI will choose to neutralize Satan’s box, if its predictor is powerful enough to predict Satan’s box”, which we may be able to prove independently of our proof that we cannot prove that A.predictor is powerful enough to predict Satan’s box.
Again, consider the analogy to common misinterpretations of Godel’s Theorem, in which it is said that humans know something that no machine can know, e.g., the truth of the Godel sentence for Peano Arithmetic. (We can prove the Godel sentence for PA in e.g. a set theory which can prove e0 is an ordinal, but believing in the truth of the Godel sentence for PA, actually trusting the proof, would require that we trusted the consistency of the system in which we reasoned about ordinal induction.) If Peano Arithmetic is consistent, then the Godel sentence is true for PA. Supposing that PA cannot prove the Godel sentence, then PA also cannot prove its own consistency. But this later statement follows because it is possible to prove, in Peano arithmetic, that the consistency of PA implies the Godel statement for PA. That is, |- Cons(PA)->GodelSentence(PA) is a theorem of PA. (So is the converse implication, obviously.)
This is a point that I feel people like Penrose overlook. Penrose does not actually know that Peano arithmetic is consistent; he has not examined every possible theorem. Even a proof in set theory, as discussed above, relies on trusting that set theory is consistent. Penrose simply assumes that PA is consistent, and therefore concludes the Godel sentence. However, PA is also capable of proving that the Godel sentence is true if PA is consistent; that PA is inconsistent if it proves Godel’s theorem; and that Godel’s Theorem is true if and only if PA does not prove it. All these are explicit theorems of PA. This, the major content of Godelian reasoning, is not something forever forbidden to mere mechanism. The step Penrose takes that Peano arithmetic does not duplicate has nothing to do with Godelian reasoning; it is the step where Penrose assumes that PA is consistent.
Why does Penrose dare this step? I would say it is a leap of faith, since if Penrose appeals to any other logical system such as ZF for justification, he must trust that system by an even greater leap of faith. It is faith by induction; Penrose observes that PA has always worked in the past, so by empirical induction he believes it will always work in the future. But it is straightforward to show the inconsistency of any formal mathematical system that asserts the consistency of any formal system which remains consistent for some finite number of steps, or after some finite amount of effort to derive an inconsistency from it. (If we start by supposing this inductive formal system to be consistent, then the formal system will eventually assert its own consistency, a contradiction. This is true even if we let the necessary number of observations be any computable function of the complexity of the formal system.)
In this project of Friendly AI, it is necessary to be just as scrupulous about what humans know and do not know, which implications are provable and which conclusions are not, as in Godelian reasoning. Look at what happened to the popular concept of Godel…
You wrote:
Thus, although the question of how we might find one of these powerful systems is interesting, it is a side issue that does not affect the correctness of the proof.
It does not affect the correctness of the math, but it greatly affects the interpretation you attach to the math - that we are giving up something important by asking for a provable Friendly AI. Consider the difference between Godel’s Theorem, which is true as math, and Penrose’s belief that this demonstrates the superiority of humans over machines, which is a false informal interpretation. I accept your math but analogously deny the informal interpretation, that there is something unFriendly AIs can do which Friendly AIs cannot - as I define “Friendly”, in terms of the implication rather than the conclusion.
This analogy is actually fairly precise. I accept Godel’s Theorem but deny Penrose’s claim that it invokes a kind of cognition which machines cannot do, and I base my denial on the fact that Peano arithmetic readily proves that |- Cons(PA)->~(Prove(GodelStatement)), which to my mind demonstrates that this kind of reasoning is quite readily formalizable. In the same sense, I accept that no consistent system will prove F(A) as you define it, but I deny that this demonstrates that unFriendly AIs can do things which Friendly AIs cannot. I base my denial on the (possible) ability of a formal system to prove |- P(A)->F(A), which is how I define Friendliness, at least if my definition were to be translated into the realm of predicate calculus. Perhaps, if the superintelligence were observed to consistently solve the box, Penrose would triumphantly declare that the superintelligence is always Friendly, even though no mere machine could prove the fact…
Now, if one supposes that the proof is correct, how might we be able to develop systems which go beyond the limitations that the result imposes? Clearly mathematical proof alone isn’t going to get us there. One possibility, that you noted above, would be to simply generate programs at random. A more efficient possibility might be evolution, where algorithms that seem to be performing very badly for a long time are considered less likely to be solutions, and algorithms that have been performing well for a long time are considered more likely. Or course this still doesn’t provide a 100% guarantee of success, nor could it as doing so would contradict the theorem. Ironically, we might be able to use proof search to speed up the evolution by enumerating systems that are provably friendly, and then discarding these because we know from the result that none of these systems are solutions.
If you are using “provably friendly” in the sense of |- F(A), then you are being inconsistent in asserting that proof search could speed up the evolution. You have proven that no “provably friendly” systems exist, and therefore no systems will be discarded by your proof search; it is a waste of computing power.
If we use my sense of “provably friendly” |- P(A)->F(A), then there is no reason why such a system should not be a solution, although we will never be able to prove P(A) for it or any other system, whether that system is provably Friendly (in my sense) or not.
I suggest that we use “provably friendly” to denote my sense of the term, because the sense |- F(A) is provably empty, and therefore does not distinguish one class of systems from another. You yourself lost track of this in your above paragraph, speaking as if some systems could be designated “provably Friendly”, demonstrating how confusing your sense of the term is - you were probably thinking in your mind something like “provably a pleasantly nice sort of intelligence”, a possibly nonempty class, while your actual mathematical definition |- F(A) means “provably capable and motivated to solve Satan’s Box”, which is impossible to prove for any system. And if you did impossibly prove it, you certainly wouldn’t toss away what you had just proven to be a full solution!
A further objection: you depict humans happily using random generation and inductive checking, or evolutionary algorithms, or whatever, and imply that a Bayesian superintelligence with |- P(A)->F(A) cannot do this. Why shouldn’t a Bayesian superintelligence use such methods? These are probabilistic methods; in fact, distinctly Bayesian methods, generating hypotheses and discarding them as they are falsified. You suggest that:
If an AI can run arbitrary bits of code without first checking them for friendliness, then there is a non-zero probability that this AI will turn into an unfriendly AI at some point in the future. Clearly then, we can’t prove that such an AI will always be friendly, and thus the theorem no longer holds.
Emulating arbitrary predictive hypotheses is not the same as running arbitrary bits of code at root level. An AI with a provably inescapable Java sandbox, and cognitive processes which provably do not violate |- P(A)->F(A) regardless of which events they observe, might justly decide that it could safely execute arbitrary Turing machines in a sandbox to see if their outputs predicted Satan’s box. In fact, a provably Friendly AI (in my sense) is likely to be far safer than any human, in such straits. The humans - if they are throwing Vast amounts of computing power at an evolutionary algorithm in the hopes of predicting a box of complexity 2000 - will in fairly short order be devoured by unstoppable horrors erupting from their Vast computer. You don’t want to run that kind of code except in the inside of a Friendly superintelligence.
We know that humans using mathematics will not be able to prove for any program that it predicts Satan’s box; that is, we cannot prove |- P(A) for any system. (Or if we suppose that humans have thirty pounds of axioms and the box is only a twenty-pound theorem, we could readily prove some superintelligence would solve it, too.) We humans are therefore limited to probabilistic methods.
This means, in plain English, that there is no strategy, no course of action, no clever plan, even the plan of sitting down and thinking about the problem, which we can guarantee will deliver an answer.
The plan of building a Bayesian superintelligence, which we prove to be Friendly in the sense of |- P(A)->F(A), which FAI will then use its own probabilistic methods at a vastly higher level of skill and intelligence to try to solve the box, also cannot be guaranteed to work in the sense of guaranteeably producing a system with P(A) (and therefore F(A) by the implication above). But if Satan’s box is killing people, I’d go for the Bayesian superintelligence over any solution involving fiddly little evolutionary algorithms, or even a bunch of human mathematicians sitting in a room trying to grok Satan’s box.
It isn’t certain to work, but, as probabilistic methods go, the probabilistic method of building a nice clean trustworthy superintelligence, provably Friendly in the sense of |- P(A)->F(A), to use ultrapowerful probabilistic methods to solve Satan’s box, seems like a neater uncertain solution than any other that comes to mind. Assuming one can do it, which is a separate problem left as an exercise to the reader. The main point is that I don’t see why it would be mathematically impossible. What is it that human beings can get together in a basement and do, which a |- P(A)->F(A) superintelligence cannot? What do we lose by proving Friendliness, in my sense of that term?
8 Eliezer Yudkowsky // Sep 17, 2006 at 12:29 am
Actually, looking over your original definitions, I find that you have used your terms inconsistently, as follows:
Let F(A,x) mean that when facing situation x the AGI system A is friendly. As a minimal condition for friendliness, it won’t kill an infinite number of people if it can avoid doing so. If for all x, F(A,x) is true then we simply write F(A). Clearly, if there exists x such that ~F(A,x), then ~F(A).
Now you seem to be defining F(A, x) to mean the same condition I referred to as P(A, x)->F(A, x), that is, the AI won’t kill an infinite number of people if it can avoid doing so.
Later on, however, you say:
What about if the AGI A is very powerful? Then we get either possibility 2 or 3. In these two cases A is so powerful that it is able to learn to predict any sequence with K-dot complexity below 2000. It has the computational and algorithmic resources to do this. However, will it actually use these resources to predict the sequence in order to save the world? By theorem 7.1 here, even when a powerful predictor belongs to the set of powerful predictors able to predict all sequences with complexity below 2000 bits (and such algorithms can be proven to exist), it is impossible to prove that any specific algorithm actually has this property. Thus, for very powerful A, even if F(A,B) were true, this cannot be proven. As a proof of F(A) would imply F(A,B) by definition, the fact that F(A,B) is not provable any for powerful A, implies that F(A) is not provable for any powerful A.
You begin by supposing that A is very powerful. Then if your original, defined sense of F(A, B) is true - what I would call P(A)->F(A) - then A will in fact solve Satan’s box. Indeed it is impossible to prove that any A has this property. This is because it is impossible to prove P(A) even in cases where it is true. Your statement that “even if F(A, B) were true, this cannot be proven” simply does not follow as you originally defined the term. F(A, B) can be true in both your possibility “1. A is too weak to save the world” and your possibility “3. A is powerful enough to save the world and does so.” We cannot prove 3 for any A. But unless we are playing at constructive mathematics, perhaps we can show “A is in 1 or 3, not 2″ without being to show which of these is the case; this is the statement which you originally defined as F(A, B).
Again, you said:
Thus, for very powerful A, even if F(A,B) were true, this cannot be proven. As a proof of F(A) would imply F(A,B) by definition, the fact that F(A,B) is not provable any for powerful A, implies that F(A) is not provable for any powerful A.
This is not true; we can prove F(A, B) as you originally defined it for very powerful A; the fact we cannot prove is the fact that A is very powerful.
Your conclusion, “F(A) is not provable for any powerful A”, therefore does not follow.
I therefore withdraw my statement that you had interpreted incorrectly a correct theorem. If I go by your original definitions above, it seems to me that you have asserted a false conclusion; due to losing track of what is provable, versus that which you have assumed true but not assumed provable within the system.
9 mathemajician // Sep 17, 2006 at 2:49 pm
I’ll skip ahead to the second comment as that’s the important one:
The theorem of Friendly AI I want to prove takes the form |- \x.P(A, x)->F(A, x)
Right, however I’d prefer to say S for “saves the world” rather than having another predicate F which is different to the predicate F that I use.
“1. A is too weak to save the world” and your possibility “3. A is powerful enough to save the world and does so.” We cannot prove 3 for any A. But unless we are playing at constructive mathematics, perhaps we can show “A is in 1 or 3, not 2″ without being to show which of these is the case;
Yes! I thought about this a bit and unless we are both mistaken, this is indeed a problem. Unless somebody else has a different opinion I think we can consider this first version of the proof flawed. Thus, in the absence of a formal proof to the contrary, it seems that the question about whether friendliness can be proven for arbitrarily powerful AIs remains open.
I continue to suspect that proving the friendliness of arbitrarily powerful AIs is impossible. My intuition, which I think Ben shares, is that once systems become extremely complex proving any non-trivial property about them is most likely impossible. Naturally I challenge you to prove otherwise. Even just a completely formal definition of what “friendly” means for an AI would be a good start. Until such a definition exists I can’t see friendly AI getting very far.
Thinking about some of the things you wrote has also made me wonder if there are easier ways to show that friendly AIs are limited. In particular, I’m concerned about whether they can safely run arbitrary code:
Emulating arbitrary predictive hypotheses is not the same as running arbitrary bits of code at root level. An AI with a provably inescapable Java sandbox, and cognitive processes which provably do not violate |- P(A)->F(A) regardless of which events they observe, might justly decide that it could safely execute arbitrary Turing machines in a sandbox to see if their outputs predicted Satan’s box.
To me it sounds like a strong statement to say that you can build a 100% safe sandbox where the thing inside the box is still able to have some effect on the state of the world outside the box. Can you prove that such a box is possible? If it isn’t, then this would seem to limit friendly AIs.
10 Nick Hay // Sep 18, 2006 at 3:11 am
I continue to suspect that proving the friendliness of arbitrarily
powerful AIs is impossible. My intuition, which I think Ben shares, is
that once systems become extremely complex proving any non-trivial
property about them is most likely impossible.
Why do you suppose a Friendly AI must be extremely complex? Given that
we do not require it to predict *all* sequences with K(x) < n, which
we can't do anyway for nontrivial n. A Friendly AI may be
(relatively) simple in structure. For example, the complexity of the
prediction mechanisms may be irrelevant to a proof that the AI always
tries to be Friendly when it can predict things. (Note that its
structure may be more than 1000 bits complex: Chaitin's barrier
doesn't apply here as we don't prove things about K(x).)
Naturally I challenge you to prove otherwise. Even just a
completely formal definition of what “friendly” means for an AI would
be a good start. Until such a definition exists I can’t see friendly
AI getting very far.
Absolutely, you have to define the Friendliness property before you
can construct systems which provably satisfy it. However, this isn’t
necessarily a simple thing to do! It may require years of work.
Therefore, lack of an immediately (where “immediately” means “without
having spent years of work”) obvious definition is not evidence a
definition will not be found if looked for.
As such, saying “until such a definition [of Friendly AI] exists I
can’t see [F]riendly AI getting very far” is analogous saying “until
such a definition of quantum gravity exists I can’t see quantum
gravity getting very far” — a major part of Friendly AI research is
finding such a precise definition, as with quantum gravity.
11 mathemajician // Sep 18, 2006 at 12:22 pm
Why do you suppose a Friendly AI must be extremely complex?
I don’t, the key bit was arbitrarily powerful AIs. If an AI is not too amazingly powerful then it won’t hit up against the kind of complexity barriers that I describe in my universal prediction paper. In which case, maybe the AI could be simple and proving things about it not too hard. However I’m reluctant to predict that any super intelligent AI will stay below any given level of performance for very long.
For example, the complexity of the prediction mechanisms may be irrelevant to a proof that the AI always tries to be Friendly when it can predict things.
I think this is where the heart of the matter lies. For example, if the AI has a prediction sub-system that is extremely complex then it may not be able to prove much about this. In which case it might not be able to use it because it can’t know whether or not this code is friendly. Perhaps a negative result along these lines is possible?
Absolutely, you have to define the Friendliness property before you
can construct systems which provably satisfy it. However, this isn’t
necessarily a simple thing to do! It may require years of work.
But how many years do we have? I’m an AI “optimist” in that I believe that strong AI could come about in the next 15 years. Even if the problem were formally solvable, I really can’t see it happening in the next 50 years. I think we need a “plan B”.
12 Nick Hay // Sep 18, 2006 at 1:29 pm
Why do you suppose a Friendly AI must be extremely complex?
I don’t, the key bit was arbitrarily powerful AIs.
If “arbitrarily powerful” means “can predict all sequences simpler than a high upper bound” then your theorem shows this. No objection here.
I wonder if pragmatically very powerful predictors can be simple e.g. ones that cannot predict *all* sequences below a given sequence just most, or just the ones that actually happen to come up in reality. This this isn’t so important, though.
For example, if the AI has a prediction sub-system that is extremely complex then it may not be able to prove much about this. In which case it might not be able to use it because it can’t know whether or not this code is friendly. Perhaps a negative result along these lines is possible?
Ok, this sandboxing question is interesting, although I’m not sure it’s the right question. You may be able get around this by proving universal statements — inferring that this particular predictor is safe because all predictors are. This may require the FAI to have a certain predictable reliably (something humans lack).
On the other hand, code maliciously designed with sufficient smartness (or chance) may be able to manipulate reality through side effects of simply existing (e.g. manipulating its computations to send out appropriately patterned EM waves, say). You may have to live with a probabilistic statement: if the predictor is not maliciously generated by something sufficiently smart then it’s likely safe.
Or maybe there’s a smart solution I didn’t think of. There should be some interesting theorems in this area.
Absolutely, you have to define the Friendliness property before you
can construct systems which provably satisfy it. However, this isn’t
necessarily a simple thing to do! It may require years of work.
But how many years do we have? I’m an AI “optimist” in that I believe that strong AI could come about in the next 15 years. Even if the problem were formally solvable, I really can’t see it happening in the next 50 years. I think we need a “plan B”.
I’m an AI “optimist” too. On the other hand, I suspect most AIs will be fatally indifferent to us. As a result selecting designs by chance will definitely fail, with high probability. So we need a strong assurance before writing an AI.
I’m not as much of a FAI “pessimist” as you, but I’m not sure this makes a difference. It’s pragmatically better to be still working on an assurance of Friendliness when someone else destroys the world, than to destroy the world because you didn’t work on the assurance.
Certain kinds of proof of correctness (maybe not as strong as those mentioned above), along with deep theoretical analysis, seem like necessary conditions to assure better than chance odds of success.
Did you read Eli’s link http://sl4.org/wiki/KnowabilityOfFAI ? It is large, so it may take time for you to find the time, but it’s well worth it. The idea that some kind of proof will be necessary because of human fallibility is explained in far more detail there, along with a bunch of other important stuff.
13 Eliezer Yudkowsky // Sep 18, 2006 at 6:16 pm
The theorem of Friendly AI I want to prove takes the form |- \x.P(A, x)->F(A, x)
Right, however I’d prefer to say S for “saves the world” rather than having another predicate F which is different to the predicate F that I use.
My apologies! At the time I started writing P(x)->F(x), I thought I was using F the way you used it. I didn’t go back and check your stated definition, which is a silly thing to miss in a mathematical argument; so again, apologies. If there’s need for further discussion I will use S(x) henceforth.
We cannot prove 3 for any A. But unless we are playing at constructive mathematics, perhaps we can show “A is in 1 or 3, not 2″ without being to show which of these is the case.
Yes! I thought about this a bit and unless we are both mistaken, this is indeed a problem. Unless somebody else has a different opinion I think we can consider this first version of the proof flawed. Thus, in the absence of a formal proof to the contrary, it seems that the question about whether friendliness can be proven for arbitrarily powerful AIs remains open.
I think we’ve reached agreement then, about both the math and its interpretation.
I note for the record that while my discourse upon Godel was correct taken in isolation, it was misdirected (Shane was not trying to say what I thought he was).
Do you want to start a new blog post called “Post-Mortem” or “Trying Again” or some such, for discussing your request for formalized Friendliness so you can prove it impossible? I do have some suggestions about simplifying the problem from your perspective. For example, if you could prove that we cannot prove that any self-modifying AI tries to transform the universe into paperclips, that would be very strong evidence that any more complex definition of Friendliness was also impossible to prove. Or to make the definition even simpler: The utility function is proportional to the number of nickel atoms existing over time, with a smooth exponential discount of 0.999/year.
However I should not need to do all your work for you - you claim to already have an intuition that Friendly AI is impossible. Not an intuition that I’m talking too vaguely for you to know whether it’s possible or not, but an intuition that there is some specific challenge which is unsolvable. So if your intuition is nonbogus, you ought to be able to verbalize your intuition of what exactly you think is impossible. Perhaps by asking yourself various questions of the type “Consider X: do I think it is impossible? Do I think it is possible but impossible to prove?” By considering how you make the judgment, you might also be able to expose the source of your intuitions, and see if you could turn them into a proof.
I continue to suspect that proving the friendliness of arbitrarily powerful AIs is impossible. My intuition, which I think Ben shares, is that once systems become extremely complex proving any non-trivial property about them is most likely impossible. Naturally I challenge you to prove otherwise. Even just a completely formal definition of what “friendly” means for an AI would be a good start. Until such a definition exists I can’t see friendly AI getting very far.
I accept the burden of proof you have placed on me. (I note that a similar burden of proof rests on Ben for “compassion”, “choice”, and “growth”.)
I think that someone who reads my publications will find themselves thinking much more precisely about Friendly AI afterward; they will be closer to formality. But not yet formal. It is not a problem I am finished solving.
In AI, “formal” is not very far from “code”. Not uncommonly, by the time you have a correct formal definition, you’re essentially done. I think that a formal definition of Friendly, literally formal in the mathematical sense, might have to be so specific about how (a) the AI processes sensory information into a model, (b) identifies goal-valent quantities in the model, (c) generates choices, and (d) self-modifies, that if you had a formal definition of “Friendly” it might compile and run. (Albeit that without further optimization it might require an unphysically large computer.) I offer this as justification for not yet having a literally formal definition of Friendly; it depends on other AGI problems and by the time I had a formal definition, my job might be 90% done. Similarly, if you want to come up with a neat proof of impossibility, the harder part of your own job is to precisely define that which you intuitively believe to be impossible.
14 mathemajician // Sep 21, 2006 at 9:59 pm
Yes, I’m going to write a follow-up blog entry. For the moment I’m a bit too busy as I’m putting the finishing touches on a 50 page journal paper that I’m writing with Hutter. In a few days I’ll be back in action and will reply to some of your comments above as well.
15 pvgoddijn // May 22, 2008 at 4:25 pm
another small problem is your minimal definition of friendlyness.
its friendly if it wont let the box kill every human.
by letting it eat a single human each time a button is pushed, 1 out of infinite human get eaten. ie just pressing the button at will is enough to satisfy the friendlyness constaint (becasue its eat one person at a time it will never eat the last one).
thus the only way to fail this constraint is to not push the button.
constructing an ‘ai’ that presses a button is provable i think . so there exist at least a subset provale friendly ai’s
16 Shane Legg // May 22, 2008 at 7:21 pm
I hate how the italics carries over from one comment to the next. Anyway…
pvgoddijn: An AGI that could work out how to press the correct button and keep the number of deaths finite but instead allows an unbounded number of people to die is “friendly”? I certainly don’t agree with your concept of friendliness!
Leave a Comment