Could AI be a risk to humans? Well it already is:

  • Elaine Herzberg was killed by an Uber self-driving car, while walking her bike across a pedestrian crossing. The system couldn’t decide if she was a bike or a person, and the switching between these two possibilities confused it. Uber had disabled the Volvo automatic braking system. (It was slowing them down.)

  • About one in 100 robot surgeries involve accidents; about 20% of these were what we’d call AI failures (things turning on at the wrong moment, or off, or misinterpreting what it sees).

  • Consider also things like the Ziyan Blowfish, an autonomous Chinese military drone currently under export to the Middle East.

  • Here’s a list of other relatively bad cases, and here’s a (melodramatic but largely valid) look at the power and scale of our existing automation of surveillance and influence.

Harm through intelligence

These systems did harm because they were too stupid to do what we ask (or because the humans deploying it are).

What about a system harming us because it is too smart? Is there any real chance that advanced AI could ruin us?

Argument from caution

We don’t know. They don’t exist, so we can’t study them and work it out. Here’s an argument for worrying, even so:

  1. It’s likely we will make a general AI (AGI) eventually.
  2. We don’t know when.
  3. We don’t know if it will be dangerous.
  4. We don’t know how hard it is to make safe.
  5. Not many people are working on this. (<500)
  6. So it’s probably worth working on.

In particular, I claim that your starting guess for P(soon & dangerous & difficult) should be at least 3%. I just put a number on this unknown risk. How?

Well, we surveyed 350 mainstream AI researchers in 2017.

  • Their median P of AGI within a century: 75%
  • Their median P of “extremely bad” outcome (human extinction, loss of governance, or worse): 5%
  • Their median P of safety being as hard or harder than capabilities: 75%

If we illicitly multiply these, we get a prior of a 3% chance of catastrophic AGI this century. 1

Now, this is weak evidence! AI researchers are notoriously bad at predicting AI; they’re probably biased in lots of ways (e.g. biased against the idea that what they’re working on could be morally wrong; e.g. biased in favour of AGI being soon).

But you should go with 3% until you think about it more than them. 2

Some of the shorter-run claims from that survey mostly turned out pretty correct.

3% is small!

Not really. It's the probability of 5 coin flips all coming up heads. Or more pertinently, the p of dying when playing Russian roulette with 1 bullet in 1 of 6 guns.

It's also roughly the same as the probability of extreme climate change, which we tend to care about a lot. Probabilities don't lead to decisions on their own; you need to look at the payoff, which here is very large.

High uncertainty is not low probability

The weakness of the evidence means we remain very uncertain - it could be 0.1% to 90%. But this is even worse when you think about it. If you are genuinely uncertain about whether there's a landmine in front of you, you don't step forward.

Against the null prior

People often act like "things should be treated as 0 probability until we see hard evidence - peer-reviewed evidence"

The last year of government failure on COVID should make you think this isn't the right attitude when evidence is legitimately scarce and lives are at stake.

It is not possible to have direct evidence yet, so it doesn't make sense to demand it. (By symmetry it also doesn't make sense to be very certain about the size of the risk.)

Reasons to worry more

People are trying hard to build it.

There are 72 public projects with the stated goal of making AGI. Most of them have no chance. But billions of dollars and hundreds of the smartest people in the world are pushing it.

In the study of viruses and bacteria, there’s a thing called “Gain of function” research, when you intentionally modify a pathogen to be more lethal or more transmissible. Most AI research is gain of function research.

We have started to use AI to speed up AI development.

We’re getting there.

GPT-3 displays quite a bit of common-sense, an extremely hard open problem. We will probably pass the Turing test within 15 years.

We’ve already passed a number of other classic benchmarks, including the fiendish Winograd schemas.

OpenAI, the people who made GPT-3, were polled. Their median guess for when AGI was 15 years.

We have greatly expanded the size of our models in the last five years. The AI lab Anthropic note that some problems (or attestations of problems) get worse with larger models

Larger LMs more often give answers that indicate a willingness to pursue potentially dangerous subgoals (Omohundro, 2008): resource acquisition, optionality preservation, goal preservation, powerseeking, and more

Indirect evidence of danger

The human precedent

There is evidence for intelligence enabling world domination: we did it. (Also through vastly superior co-ordination power.) Chimps are maybe the second-most intelligent species, and they are powerless before us. They exist because we let them.

Another worry from the human case is that we seem to have broken our original “goal”. Evolution optimised us for genetic fitness, but produced a system optimising for fun (including directly anti-fitness fun like birth control and disabling depressants).

Lastly, we are a terrible case study in doing harm without hatred, just incentives. No malevolence needed: chimps are just made of / living among stuff we can use.

The thought is that humans are to chimps as AGI is to humans.

Intelligence is not wisdom

People sometimes say that it’s a nonissue, since any system that is truly intelligent would also be wise, or would know what we meant, or care.

Two counterexamples:

  • Human sociopaths: sometimes highly intelligent while lacking any moral sense
  • Reinforcement learning algorithms. Their goals (reward function) are completely separate from their intelligence (optimiser / planner).

RL is the most likely current technology to eventually become an AGI. It has a few worrying features: autonomous (no human input as standard), maximising, and with hand-written goals, with <100 variables. i.e. they are told to value only a tiny fraction of the environment.

“Morality is complicated, AIs [are trained] to be simple”

Current stupid systems still cheat ingeniously

They come up with ingenious ways to subvert their goals, if that is easier than actually doing the task.

  • Coastrunners. An RL bot was given the goal of winning the race as fast as possible. It worked out that actually it could get infinite points if it never finished the race, but just collected these powerups forever.

  • A robot was trained to grasp a ball in a virtual environment. This is hard, so instead it learned to pretend to grasp it, by moving its hand in between the ball and the camera. Trying to deceive us.

A genetic debugging algorithm, evaluated by comparing the program’s output to target output stored in text files, learns to delete the target output files and get the program to output nothing.
Evaluation metric: “compare youroutput.txt to trustedoutput.txt”
Solution: “delete trusted-output.txt, output nothing”

The point of these examples are: We cannot write down exactly what we want. The history of philosophy is the history of failing to perfectly formalise human values. Every moral theory has appalling edge cases, where the neat summary fails.

If we don’t write down exactly what we want, then the system will find edge cases. They already do.


The worst kind of cheating is treachery: initially seeming aligned, then switching to dangerous behaviour when you can get away with it (for instance, after you’ve completely entrenched yourself).

On the face of it this seems less likely, since it requires more machinery (two goals, and hiding behaviour, and a second-order policy to decide between them), and requires us to not be able to fully inspect the system we “designed”. But we can’t fully inspect our current best systems, and it too has already been observed in systems not designed for deceit.

Our current training paradigm can lead to undesirable behaviour being both disincentivised and merely hidden, since it’s the display of the behaviour that gets penalized by the feedback process. We might thus be selecting against warning signs of misalignment.

Naive response: “just severely penalize any hint of hidden goals”. This only delays the problem, since such a penalty puts selection pressure on proportionally more patient deception. The response also doesn’t apply to tasks where human scoring of results is difficult (i.e. any long-range task with slow feedback loops or messy causal inference).

One problem with this line of thinking is that it makes the exact same predictions as the inverse scenario, “AI is safe and has no power-seeking tendencies”, right up until it’s too late. This is the worst situation to be in, unless it’s not.

We can’t even make groups of humans (e.g. corporations) do the right thing.

No one at an oil company loves pollution, or hates nature. They just have strong incentives to pollute. Also strong incentives to stop any process which stops them (“regulatory capture”).

We’ve maybe gotten a bit better at aligning them: corporations mostly don’t murder thousands of strikers anymore.

We should expect AI to be worse. The parts of a corporation, humans, all have human values. Almost of them have hard limits on how much harm they will do. Corporations have whistleblowers and internal dissent (e.g. Google employees got them to pull out of military AI contracts).

(Governments are much the same; it wasn’t the United Fruit Company that fired the rifles.)

Most goals are not helpful

Look around your room. Imagine a random thing being changed. Your chair becomes 3 inches shorter or taller; your fridge turns upside down; your windows turn green, whatever.

Humans want some crazy things (e.g. to cut fruit out of their own mouths with a chainsaw).

But for most possible goals, no one has ever wanted them

(“Replace the air in this room with xeon gas”
“Replace the air in this room with freon gas”
“Replace the air in this room with radon gas…”)

i.e. Human-friendly goals are a small fraction of possible goals. So without strong targeting, a given goal will not be good for us.

We currently do not have the ability to specify our goals very well, and the systems aren’t very good at working them out from observing us.


  1. Hand-written goal specifications usually omit important variables
  2. Omitted variables are often set to extreme values.
  3. So hand-written specs will often set important things to (undesirably) extreme states.

(To convince yourself of (2), have a go at this linear programming app, looking at the “model overview” tab.)

Society is insecure

When will the first anonymous internet billionaire be?

This has already happened. The anonymous creator of bitcoin holds 1 million BTC, and the price hit $1000 in 2014. In practice he couldn't have extracted all or most of that into dollars, but, as we see since, he wouldn't need to.

So we see that immense value can be created - just using programming + internet + writing. Once you have a billion dollars and no morals, there's not a lot you can't do.

Our societies are increasingly vulnerable to hacking. Last month someone tried to remotely poison a Florida city’s water supply. A few years ago, large parts of Ukraine’s power grid were shut down, just as a civil war erupted.

The American nuclear launch code was, for 20 years, “0000000”. Here’s a long list of past incidents, some more worrying than others. What else is currently wide open?

Maximisers are risky

The above is why it will probably do the wrong thing, and why it will have the opportunity. But why expect it to do something extremely wrong?

  1. Intelligence and benevolence are distinct. So an AGI with unfriendly goals is possible.

  2. A maximiser will probably have dangerous intermediate goals: resource acquisition, self-defence, resistance to goal changes.

  3. So a maximising AGI will default to dangerous behaviour. And it might be that you only get one chance to load your values into it.

A corporation is a profit maximiser, and this is probably part of why they do bad stuff.

Again, all of the best current systems are maximisers.

Theoretical risk

People have come up with real but unrealistically slow algorithms for general intelligence: each of them entail an attempt to gain arbitrary power.

So if we assume that such models tell us anything about the tractable systems which will approximate them, we have weak reason to think that the default outcome is unsafe.

Consider the following possible reactions to an instruction:

  1. Do what I say (“wash the dishes”: autoclave the dishes)
  2. Do what I mean (wash the dishes with water and gentle detergents)
  3. Do what makes me think you’ve done what I want (hide the dishes)
  4. Do what makes me say you’ve done what I want (threaten me until I click “complete”)
  5. Do things which correlate with what I mean (disc-sand all objects in the area)
  6. Do what removes me from the reward process (hack yourself and give yourself infinite washed dishes)

Until we understand intelligence better, we need to give some weight to each of these. Only (2) could be safe (once we also solve the problem of humans meaning harm).

The mess of society

A.I. hasn’t yet had its Hiroshima moment; it’s also unclear how such a decentralized & multipurpose field would or could respond to one. It may be impossible to align the behavior of tens of thousands of researchers with diverse motives, backgrounds, funders, & contexts, in a quickly evolving area.

Matthew Hutson

All of the above is how hard it is to solve a subproblem of AI safety: 1 AI with 1 human. Other problems we need to at least partly solve:

  • Deep mathematical confusion
  • Philosophical baggage (can’t teach values if you can’t agree on them)
  • Political economy (arms races to deploy shoddy systems)
  • Ordinary software hell (no one writes safe code)
  • Massive capabilities : safety funding ratio. 20,000 : 1?
  • Treacherous turn
  • AI is maybe worse than nukes, climate change, engineered pandemic. Those don’t follow you, don’t react to your countermeasures.

And huge questions I didn’t even mention:

Overall, my guess of this turning out terrible is 20%. One round of Russian roulette.


Most of the above are other people’s ideas.

  1. The 2022 numbers are similar (n=737).

    See also.
  2. Importantly, this whole-number percentage takes us out of the extremely dodgy realm of Pascal's wager and 'fanaticism'.


Post a comment:

Comments appear after moderation. use markdown.
Page just refreshes after successful submit.

Tags: AI, xrisk


[RSS] /
[podcast] /