Harm through stupidity

Could AI be a risk to humans? Well it already is:

  • Elaine Herzberg was killed by an Uber self-driving car, while walking her bike across a pedestrian crossing. The system couldn’t decide if she was a bike or a person, and the switching between these two possibilities confused it. Uber had disabled the Volvo automatic braking system. (It was slowing them down.)

  • About one in 100 robot surgeries involve accidents; about 20% of these were what we’d call AI failures (things turning on at the wrong moment, or off, or misinterpreting what it sees). (This seems to be lower than the human rate.)

  • Consider also things like the Ziyan Blowfish, an autonomous Chinese military drone currently under export to the Middle East.

Harm through intelligence

These systems did harm because they were too stupid to do what we ask (or because the humans deploying it are).

What about a system harming us because it is too smart? Is there any real chance that advanced AI could ruin human potential on a grand scale?

Argument from caution

We don’t know. They don’t exist, so we can’t study them and work it out. Here’s an argument for worrying, even so:

  1. It’s likely we will make a general AI (AGI) eventually.
  2. We don’t know when.
  3. We don’t know if it will be dangerous.
  4. We don’t know how hard it is to make safe.
  5. Not many people are working on this. (<500)
  6. So it’s probably worth working on.

In particular, your starting guess for P(soon & dangerous & difficult) should be at least 3%.

I just put a number on the risk of this unknown thing. How?

Well, we surveyed 350 mainstream AI researchers in 2017.

  • Median P of AGI within a century: 75%
  • Median P of “extremely bad” outcome (human extinction, loss of governance, or worse): 5%
  • Median P of safety being as hard or harder than capabilities: 75%

If we illicitly multiply these, we get a prior of a 3% chance of catastrophic AGI this century.

This is weak evidence! AI researchers are notoriously bad at predicting AI; they’re probably biased in lots of ways (e.g. biased against the idea that what they’re working on could be morally wrong; e.g. biased in favour of AGI being soon).

But you should go with 3% until you think about it more than them.

3% is small!

Not really. It's the probability of 5 coin flips all coming up heads. Or more pertinently, the p of dying when playing Russian roulette with 1 bullet in 1 of 6 guns.

It's also roughly the same as the probability of extreme climate change, which we tend to care about a lot. Probabilities don't lead to decisions on their own; you need to look at the payoff, which here is very large.

High uncertainty is not low probability

The weakness of the evidence means we remain very uncertain - it could be 0.1% to 90%. But this is even worse when you think about it. If you are genuinely uncertain about whether there's a landmine in front of you, you don't step forward.

Against the null prior

People often act like "things should be treated as 0 probability until we see hard evidence - peer-reviewed evidence"

The last year of government failure on COVID should make you think this isn't the right attitude when evidence is legitimately scarce and lives are at stake.

It is not possible to have direct evidence yet, so it doesn't make sense to demand it. (By symmetry it also doesn't make sense to be very certain about the size of the risk.)

Reasons to worry more

People are trying hard to build it.

There are 72 public projects with the stated goal of making AGI. Most of them have no chance. But billions of dollars and hundreds of the smartest people in the world are pushing it.

In the study of viruses and bacteria, there’s a thing called “Gain of function” research, when you intentionally modify a pathogen to be more lethal or more transmissible. Most AI research is gain of function research.

We’re getting there.

GPT-3 displays quite a bit of common-sense, an extremely hard open problem. We will probably pass the Turing test within 5 years.

We’ve already passed a number of other classic benchmarks, including the fiendish Winograd schemas.

OpenAI, the people who made GPT-3, were polled. Their median guess for when AGI was 15 years.

Indirect evidence of danger

The human precedent

There is evidence for intelligence enabling world domination: we did it. (Also through vastly superior co-ordination power.) Chimps are maybe the second-most intelligent species, and they are powerless before us. They exist because we let them.

Another worry from the human case is that we seem to have broken our original “goal”. Evolution optimised us for genetic fitness, but produced a system optimising for fun (including directly anti-fitness fun like birth control and disabling depressants).

Lastly, we are a terrible case study in doing harm without hatred, just incentives. No malevolence needed: chimps are just made of / living among stuff we can use.

The thought is that humans are to chimps as AGI is to humans.

Intelligence is not wisdom

People sometimes say that it’s a nonissue, since any system that is truly intelligent would also be wise, or would know what we meant, or care.

Two counterexamples:

  • Human sociopaths: sometimes highly intelligent while lacking any moral sense
  • Reinforcement learning algorithms. Their goals (reward function) are completely separate from their intelligence (optimiser / planner).

RL is the most likely current technology to eventually become an AGI. It has a few worrying features: autonomous (no human input as standard), maximising, and with hand-written goals, with <100 variables. i.e. they are told to value only a tiny fraction of the environment.

Current stupid systems still cheat ingeniously

They come up with ingenious ways to subvert their goals, if that is easier than actually doing the task.

  • Coastrunners. An RL bot was given the goal of winning the race as fast as possible. It worked out that actually it could get infinite points if it never finished the race, but just collected these powerups forever.

  • A robot was trained to grasp a ball in a virtual environment. This is hard, so instead it learned to pretend to grasp it, by moving its hand in between the ball and the camera. Trying to deceive us.

A genetic debugging algorithm, evaluated by comparing the program’s output to target output stored in text files, learns to delete the target output files and get the program to output nothing.
Evaluation metric: “compare youroutput.txt to trustedoutput.txt”
Solution: “delete trusted-output.txt, output nothing”

The point of these examples are: We cannot write down exactly what we want. The history of philosophy is the history of failing to perfectly formalise human values. Every moral theory has appalling edge cases, where the neat summary fails.

If we don’t write down exactly what we want, then the system will find edge cases. They already do.

The worst kind of cheating is treachery: initially pretending to be aligned, then switching to dangerous behaviour when you can get away with it (for instance, after you’ve completely entrenched yourself). This seems less likely, since it requires more machinery (two goals, and hiding behaviour, and a second-order policy to decide between them), and requires us to not be able to fully inspect the system we “designed”. But we can’t fully inspect our current best systems, and it too has already been observed in a system not designed for deceit.

We can’t even make groups of humans (e.g. corporations) do the right thing.

No one at an oil company loves pollution, or hates nature. They just strong incentives to pollute. Also have strong incentives to stop any process which stops them (“regulatory capture”).

We’ve maybe gotten a bit better at aligning them: corporations mostly don’t murder thousands of strikers anymore.

We should expect AI to be worse. The parts of a corporation, humans, all have human values. Almost of them have hard limits on how much harm they will do. Corporations have whistleblowers and internal dissent (e.g. Google employees got them to pull out of military AI contracts).

(Governments are much the same; it wasn’t the United Fruit Company that fired the rifles.)

Most goals are not helpful.

Look around your room. Imagine a random thing being changed. Your chair becomes 3 inches shorter or taller; your fridge turns upside down; your windows turn green, whatever.

Humans want some crazy things (e.g. to cut fruit out of their own mouths with a chainsaw).

But for most possible goals, no one has ever wanted them

(“Replace the air in this room with xeon gas”
“Replace the air in this room with freon gas”
“Replace the air in this room with radon gas…”)

i.e. Human-friendly goals are a small fraction of possible goals. So without strong targeting, a given goal will not be good for us.

We currently do not have the ability to specify our goals very well, and the systems aren’t very good at working them out from observing us.


  1. Hand-written goal specifications usually omit important variables
  2. Omitted variables are often set to extreme values.
  3. So hand-written specs will often set important things to (undesirably) extreme states.

(To convince yourself of (2), have a go at this linear programming app, looking at the “model overview” tab.)

Society is insecure

When will the first anonymous internet billionaire be?

This has already happened. The anonymous creator of bitcoin holds 1 million BTC, and the price hit $1000 in 2014. In practice he couldn't have extracted all or most of that into dollars, but, as we see since, he wouldn't need to.

So we see that immense value can be created - just using programming + internet + writing. Once you have a billion dollars and no morals, there's not a lot you can't do.

Our societies are increasingly vulnerable to hacking. Last month someone tried to remotely poison a Florida city’s water supply. A few years ago, large parts of Ukraine’s power grid were shut down, just as a civil war erupted.

The American nuclear launch code was, for 20 years, “0000000”. What else is currently wide open?

Maximisers are risky

  1. Intelligence and benevolence are distinct. So an AGI with unfriendly goals is possible.

  2. A maximiser will probably have dangerous intermediate goals: resource acquisition, self-defence, resistance to goal changes.

  3. So a maximising AGI will default to dangerous behaviour. And it might be that you only get one chance to load your values into it.

A corporation is a profit maximiser, and this is probably part of why they do bad stuff.

Again, all of the best current systems are maximisers.

The mess of society

A.I. hasn’t yet had its Hiroshima moment; it’s also unclear how such a decentralized & multipurpose field would or could respond to one. It may be impossible to align the behavior of tens of thousands of researchers with diverse motives, backgrounds, funders, & contexts, in a quickly evolving area.

Matthew Hutson

All of the above is how hard it is to solve a subproblem of AI safety: 1 AI with 1 human. Other problems we need to at least partly solve:

  • Deep mathematical confusion
  • Philosophical baggage (can’t teach values if you can’t agree on them)
  • Political economy (arms races to deploy shoddy systems)
  • Ordinary software hell (no one writes safe code)
  • Massive capabilities : safety funding ratio. 20,000 : 1?
  • Treacherous turn
  • AI is maybe worse than nukes, climate change, engineered pandemic. Those don’t follow you, don’t react to your countermeasures.

And huge questions I didn’t even mention:

Overall, my guess of this turning out terrible is 15%. One round of Russian roulette.


Most of the above are other people’s ideas.


Post a comment:

Comments appear after moderation. You can use html or markdown.
Page just refreshes after successful submit.

Tags: ai, xrisk