Unprovability of Friendly AI

by Shane Legg

Trying to prove that an AI is friendly is hard, trying to define “friendly” is hard, and trying to prove that you can’t prove friendliness is also hard. Although it is not the desired possibility, I suspect that the latter is actually the case. Of course if I want to really establish this, I need to prove it. A few months ago Ben Goertzel challenged me to do exactly that, as he also was concerned that this might be the case. I tried a few times and failed, however I often felt like I was close to getting the proof done. I outlined the basic idea in my last blog post under the heading “Provably unprovable”. Here I’ll try put some more flesh on this argument based on a version of it that I came up with last night. Please feel free to criticise it. As this is slippery stuff, I’ll only start to feel confident that the proof works once many people have tried to find flaws in it. I may need to alter the argument to fix problems where needed.

Anyway, here’s the story of the “Black box from Hell”. I set the scene with a cute story and then get into the details…

One day Satan popped out of the ground bearing a present, which he then presented to the people of the world. It was a nice shiny black box with two buttons, reading “0″ and “1″. You could go up to the box and push one of the buttons, if you hit the right one nothing happened, but if you hit the wrong one a big mouth appeared and eat the nearest person to the back of the box. Unfortunately, if the people of the world decided to simply stop playing the game, then Satan would instruct it to eat everybody in the world.

“Your game sux,” said the people, “How can we possibly know what button to press each time?”

“Well,” said Satan, “the pattern isn’t random, it’s generated by an algorithm, you just have to work out what the algorithm is and you’ll be fine. And because I’m a reasonable guy, I even limited the K-dot complexity of the sequence to be less than 2,000 bits.”

In a puff of smoke, Satan was gone.

Reluctantly, people started to play this game, and sure enough the box started to eat people. Fortunately, the first people being fed to the machine were politicians and lawyers and so they weren’t greatly missed anyway. Due to uncontrolled population growth, there were an infinite number of people in this world, however they soon realised that if they kept on pushing wrong buttons forever an infinite number of them would be eaten, and that was clearly a very bad thing. Something needed to be done, and so the king of the land called a meeting.

Much discussion at the meeting took place and for some time nobody was sure what to do about this problem, that was until a wild looking character by the name of Dr. Goertzel burst forth from the crowd,

“You Majesty, I have a solution! I have designed an AGI system that will be vastly more intelligent than even the smartest human. I could build this super intelligent machine and use it to predict the sequence and save the world…. all I need to make this happen is more funding.”

“Brilliant,” announced the king, “quick, somebody give this guy more funding.”

“Yay!” cheered the people, but their celebration did not last long, because from the crowd another wild character burst forth, this time sporting a super hero’s cloak and wearing his underwear on the outside.

“Wait just one moment there! It’s all every well that you go and build a super intelligent machine, but how do we know that this machine will be nice to us? If we’re not careful, it could be even worse than that damn box from hell. We need to first prove that this AGI design is going to be friendly.”

“Who’s he?” the king whispered to an aid. “I believe that’s a Mr. Yudkowsky,” the aid replied. The wise king stroked his beard and thought about this for a minute before making his response.

“Yes, I believe this man does have a point. The last thing we need is a super intelligent machine that’s even worse than the black box from hell. I hereby summons the smartest mathematicians in the land and give them the task of proving that Dr. Goertzel’s AGI design is going to be friendly and won’t try to kill us all.”

“Yay!” cheered the crowd.

Unfortunately, here’s what the mathematicians found:

Let F(A,x) mean that when facing situation x the AGI system A is friendly. As a minimal condition for friendliness, it won’t kill an infinite number of people if it can avoid doing so. If for all x, F(A,x) is true then we simply write F(A). Clearly, if there exists x such that ~F(A,x), then ~F(A).

Let B be the situation above with the black box from hell, and let w be the computable sequence that the box generates. We know that K-dot (w) < 2000, where K-dot is the relative to Kolmogorov complexity defined in section 5 here.

Will A save the world, that is, will it limit the number of prediction errors to be finite? There are three possibilities:

  1. A is too weak to save the world.
  2. A is powerful enough to save the world, but doesn’t. Letting everybody in this world die when you could save them is clearly unfriendly, formally ~F(A,B).
  3. A is powerful enough to save the world and does so. Formally, F(A,B). Note that this does not imply F(A), because there could exist other problems for which A was unfriendly.

If 1 is the case then we may or may not be able to prove F(A) based on what we know about A. I don’t know.

What about if the AGI A is very powerful? Then we get either possibility 2 or 3. In these two cases A is so powerful that it is able to learn to predict any sequence with K-dot complexity below 2000. It has the computational and algorithmic resources to do this. However, will it actually use these resources to predict the sequence in order to save the world? By theorem 7.1 here, even when a powerful predictor belongs to the set of powerful predictors able to predict all sequences with complexity below 2000 bits (and such algorithms can be proven to exist), it is impossible to prove that any specific algorithm actually has this property. Thus, for very powerful A, even if F(A,B) were true, this cannot be proven. As a proof of F(A) would imply F(A,B) by definition, the fact that F(A,B) is not provable any for powerful A, implies that F(A) is not provable for any powerful A.

Alternatively, a powerful A might not be friendly and would allow everybody to die, even though it could save them. Clearly then A is not friendly and so F(A) is false and thus obviously cannot be proven true.

In either case, we see that for any powerful A the preposition F(A) cannot be mathematically proven to be true.

Some may attack this argument by saying that B (the black box from hell) is an unlikely object. However it is not the case that B is some strange incomputable object that could not logically exist. B is in fact a Turing computable device (the mouth eating people part aside) that even has bounded complexity. Furthermore, it can be proven that algorithms do indeed exist which can deal with B and thus save the world. It’s just that we cannot use mathematical proof to actually check whether any algorithm has this property.