Friendly AI is bunk

by Shane Legg

One of my recent posts to my old blog has generated quite a bit of interest. As I wasn’t able to reply to the comments due to the blog being broken, I’m going to repost it here along with the comments.

In case the title of the post gives the wrong impression, I’d also like to clarify that I think that trying to build AI’s so that they don’t do nasty things is obviously a good idea. My problem is that I think that the current efforts are woefully inadequate. The only way to seriously deal with this problem would be to mathematically define “friendliness” and prove that certain AI architectures would always remain friendly. I don’t think anybody has ever managed to come remotely close to doing this, and I suspect that nobody ever will. Even worse, I suspect that the only stable long term possibility is a super AGI that is primarily interested in its own self preservation. All other possibilities are critically unstable due to evolutionary pressure.

One other thing. The most active place on the internet for discussing Friendly AI is the SL4 email list. Ironically, it must be one of the most hostile email lists on the internet with frequent flame wars and people being banned from the list. The moderation system consists of so-called “list snipers” whose job it is to ban discussions that they don’t like. If these people are experts in friendliness… lord help us.

Anyway, here’s my original post:

Yesterday I went down to Genoa in Italy to meet up with Ben and Izabela Goertzel and talk AI. One of the things we discussed was friendly AI, that is, the idea that you can build AIs in such a way that they will not do nasty things — like killing off the human race. I think the idea is an impossible dream, and it seems to me that both Ben and Izabela are pretty skeptical of it too, however Ben at least attempted to put up the other side of the argument. I thought I might try to list my main arguments against the possibility of friendly AI here.

Tough love or killing us with kindness. Does “friendliness” have any meaning? If a super AI decided to start making all our wishes come true, might we just end up killing ourselves or at least becoming very unhappy? We’ve all heard stories of people who have won $50 million in a lottery and then years later claim that it destroyed their lives. Alternatively, perhaps a super AI might do something that seems extremely bad, like killing off billions of people, but only later, in the long run, we realise that this was in fact the friendliest thing for it to do. A bit like how your father didn’t allow you to do something as a child. At the time you didn’t think he was being very nice to you, but years later you understand and are thankful for what he did as you realise that it was in your own best interests. If seemingly terrible things can be really good, and seemingly wonderful things can be really bad, how could anybody figure out what is or is not a friendly action? Even with hindsight people still can’t agree on whether certain things in history were good or bad. Usually things are good in some ways and for some people, but bad in others.

Deadly butterfly wings. We all know the idea from chaos theory that a single flap of the wings of a butterfly could cause a hurricane a few weeks later. In which case, if an AI did some trivial act surely that could trigger a terrible event some time later? As even a super AI’s powers are limited, it might not realise this, in which case, was it being friendly or not? Or is friendliness the intent to be friendly, not what actually ends up happening. In the latter case, are fanatics being friendly when they do nasty stuff because they are really just trying their best to save the world?

Beautiful tool, terrible owners. Even if an AI didn’t have the motivation to do nasty stuff itself, it might well have owners with screwed up ideas. As they say, “power corrupts and absolute power corrupts absolutely”.

Evil in disguise. A super AI might invent a new drug to cure a terrible disease, knowing full well that within a few years of this new drug coming out somebody will discover closely related technology that will spell almost certain doom for the human race. We blame some crazy scientist for killing off the human race, but in fact the process was actually set off by a very sneaky super AI. It didn’t need to lift a finger, it just published a short research paper and sat back and waited for people to do the rest.

The provably unprovable. As I showed in a recent paper, while extremely powerful prediction algorithms exist that can predict all sequences up to any given Kolmogorov complexity, you can’t actually prove this for any specific predictor beyond a Kolmogorov complexity of about 1,000 bits. So let’s say you have a super AI and it contains one of these amazing ultimate 2,000 bit predictors. Every day some problem comes along where the AI has to predict some 1,500 bit complexity sequence in order to save the world. Will your AI save the world every day? In other words, is it friendly? Can you prove that it’s friendly? No you can’t, because if you could then you would have in effect proven that the AI could predict any sequence up to a Kolmogorov complexity of 1,500 bits, and that’s impossible. Thus for this AI system you can’t prove that it’s friendly, even if it is. The same goes for even more powerful AIs that can predict all sequences up to a Kolmogorov complexity of 3,000 or 100,000 bits etc. Thus, if you can prove the friendliness of a AI system then the power of this AI must be below the 1,000 bit bound.

Of all these things I still think the first is the biggest problem. What is friendly? I can have an idea of what “friendly” means for other people and the things they do in my life. But in the context of a super intelligent machine, the whole concept breaks down. If I can’t define or measure something, I can’t say anything solid about it.