An imitation test for moral capacity

by Shane Legg

Yudkowsky has been posting a lot on Overcoming Bias recently about his theory of metaethics.  Today he posted a summary of sorts.  Essentially he seems to be saying that morality is a big complex function computed by our brain that doesn’t derive from any single unifying principle.  Rather, this function is a mishmash of things and even we don’t really know what our own function is, in the sense that we are unable to write down an exact and complete formulation.  It’s just something that we intuitively use.

I’m not convinced that ethics can’t be derived from some deeper unifying principle.  I’m also not convinced that it can, lest you misunderstand me.  What I do accept is that if this is possible then finding such a principle and convincingly arguing for it is likely to be difficult in the extreme, and probably not something that is likely to happen before the singularity.  Nevertheless, I haven’t yet seen any argument so devastating to this possibility that I’m willing to move it from being extremely difficult to certainly impossible.  Any system of ethics that does derive from some unifying metaethical principle is almost certainly going to be different to our present (western?) ethical notions.  I think some degree of this is acceptable, given that our ethical ideas do change a bit over time.  Furthermore, no matter how human we try to make the ethical system of a powerful AGI, post-singularity we are still going to be faced with ethical challenges that our pre-singularity ethics were never set up to deal with.  Thus, our ethics are going to have to be modified and updated in order to remain somewhat consistent and viable, otherwise we’ll end up with this kind of nonsense.

Anyway, let’s assume that this unifying principle either does not exist, or at least can’t be found.  How can we tell if an AGI is ethical given that we can’t explicitly and completely specify what this means?  This seems like the problem Turing faced when trying to determine whether a machine is intelligent or not.  He figured that he couldn’t explicitly and completely say what intelligence is, unlike the research by Hutter and myself, and thus he tried to dodge the issue in the obvious way by setting up an imitation game that doesn’t require an explicit description of intelligence.

Here we can do something similar: set up a group of people and the AGI and ask them ethical questions from a panel of expert judges.  If the judges cannot tell which the machine is, then it passes.  Given that the morality function varies between people, and that we can’t say explicitly and completely what our own function is, this seems to be about the best we could hope for.  Naturally, this doesn’t prove that the AGI, or indeed any of the humans participating, are “good”.  An evil genius could probably pass such a test.  Rather, it is simply designed to test whether the AGI is at least able to compute a version of the human morality function which is sufficiently similar to ours that it is able to pass as being human.  Whether the AGI (or human) actually takes its human-passable morality function and reliably and consistently seeks to follow it into the future is a whole other set of problems.  Thus, passing such a test is perhaps a necessary, but certainly not a sufficient condition for having an ethical AGI.

I’m sure somebody must have proposed this idea before, but at least my half hearted attempt to find the idea on Google didn’t turn up anything.  I should also point out that in order for this test to work you’d probably want the AGI to pass a more general Turing test first so that it doesn’t get singled out by the judges for various other reasons.  Only then should you bring in a group of expert ethicists to try to judge which of the test subjects was ethically inhuman.  We would also want to include in the test subjects a few very nice people and a couple of professional ethicists as we wouldn’t want the AGI to be able to “fail” for being too nice or consistently ethical.