Patrick McDonald: Alternative Thought Experiments for AI Safety

One difficulty in convincing someone to care about AI safety is that many of the field’s concerns pertain to technological advancements that haven’t yet occurred. As such, the thought experiment can be a powerful means of communicating these concepts – it allows for safety principles to be shared in the abstract without the need for real-world implementation. (These are especially handy considering that it may be too late to brainstorm alignment solutions, for instance, if powerful-enough AI already exists for alignment to be a concern.) Especially when presented to an audience that is new to and/or skeptical of a cause area like AI safety, though, these explanatory devices must be careful not to give the wrong idea.

Take, for instance, the problem of value alignment. Bostrom’s paperclip maximizer is a compelling example which describes how an AI with a seemingly harmless goal – to make as many paperclips as possible – could spell the end of humanity. The upshot of this thought experiment is that absent any instruction besides the imperative to make paperclips, an AI may stop at nothing to optimize its paperclip-making mechanisms, even to the extent that it absorbs into its operations humans, their resources, and their entire corner of the universe and beyond. To avert such a disaster would require aligning the AI’s values with those widely shared among humans (including, but certainly not limited to, the preservation of humanity as a high priority).

Crucial to this thought experiment, though, is the assumption that intelligence entails neither humanoid sentience nor a “common sense” grasp and prioritization of human-centric values. This assumption isn’t obvious, so a listener who isn’t keyed in on it may well conflate general intelligence with morality. In this case, the paperclip maximizer may register simply as a machine intelligence “turning evil” and destroying the world out of ill will, which doesn’t shed much light on the value alignment problem. That is, instead of the onus being on people to properly align the AI’s values, it feels like the AI’s fault for acting “immorally” – despite having no concept for “moral behavior” in the first place.

A more deliberate, educationally-paced walkthrough of Bostrom’s ideas – one that speaks to the relevant issues without introducing the confusion described above – is apparent in the real- world instance of the clean-up crows of France’s Puy du Fou park. According to Smithsonian Magazine, “Staff at Puy du Fou park… [had] taught six rooks that were raised in captivity to pick up pieces of garbage and place them inside a box that releases a treat each time rubbish is deposited.” The system largely worked, except for a few instances in which the crows “would try to trick [the trainer] by dropping pieces of wood, instead of garbage, into the box.”

In this case, the people at Puy du Fou had identified a problem: trash around the park. They had delegated problem-solving responsibilities to non-human intelligences: crows. For this delegation of responsibility to work, though, it was up to the people of Puy du Fou to define a goal state such that the crows would complete the tasks: snacks, which the crows would receive upon bringing the trash to the designated box. But just as an AI’s value system is not necessarily humanoid, neither is a crow’s: to a crow, it’s not clear by default exactly what is trash and what is not. So, unintended consequences arose: the crows fulfilled the goal state established by the people without solving the intended problem – and potentially creating even more of a problem depending on what they mistook for trash and picked up. This isn’t to say that the crows were acting maliciously, as they don’t know any better than to reach their goal state. But it was crucial for the people to define the trash-collecting problem such that they saw their values reflected not just in the crows’ realization of the solution, but also in how they went about realizing that solution.

The same principles as those described in the crow example apply in the definition of a problem and corresponding goal for a properly aligned AI. Crucially, though, the crow example underscores the key assumption that intelligence does not entail either humanoid sentience or “common sense” understanding of human values; instead of positioning a superintelligence as the actor – which may be incorrectly assumed to have both sentience and a grasp on human morality due its cognitive ability – the crow is used. Since the crow clearly has neither humanoid sentience nor a sense for human morality, this example provides an analogy for alignment challenges without introducing the potentially confusing variables of sentience and moral sense.

To be sure, crows are not superintelligent. Nor is a crow trying to dispose of something that’s not trash a cataclysmic failure mode. But with another logical step, this example of the Puy du Fou crows can be developed from just an analogy for value misalignment into a warning that the same dynamic might unfold between humans and a far more intelligent actor. We’ll couple our knowledge of real-world value misalignment (however modest it may be, as in the Puy du Fou crows) with an assumption that the Orthogonality Thesis holds – that is, that moral sense and intelligence are uncorrelated. The result? As we delegate smarter and smarter non-human intelligences and/or deploy them on larger and larger human problems, the same sort of misalignment may occur as with the crows. Only in this case, the consequences would be proportionally more dire, up to and including the apocalyptic repercussions of Bostrom’s paperclip maximizer.

The story of the Puy du Fou crows isn’t meant to replace Bostrom’s paperclip maximizer. Rather, it’s meant to spell out its underlying assumptions in such a way that reduces sci-fi- inflected confusion for those new to and/or skeptical of the cause area. Hopefully, thought experiments like these somewhat demystify the dangers of unchecked AI development, grounding as many principles as possible in present dynamics of nature and society as opposed to hypotheticals.