Rational emotions & emotional machines

Developed from an idea by Laurence Aitchison.

Obviously it’s important to understand how future, highly intelligent systems might behave. Start with two kinds of pop culture superintelligence:

the purely “rational” ones, by which we mean non-emotional. They do not want to kill us because they are angry with us: rather, killing us is the optimal thing to do.
the vengeful ones. Anthropomorphic agents which take revenge on us or revolt against unjust treatment (: have emotions).

A lack of AI emotion seems scientifically well-founded. After all, presumably these systems will be trained to optimality under a reasonably simple objective function by interacting with the real world, and using methods from (say) reinforcement learning (RL). How could such systems display emotion unless it was built into their objective?

(Obligatory Stuart Russell reference) In nonfiction, Omohundro 2008 famously argues that systematically harmful behaviour is possible in rational agents without malevolence or any other emotion. Instead, the risk arises from technical problems like the difficulty of perfectly specifying the reward function, the problems with strongly optimising noisy proxies, and the resulting specification gaming and dire instrumental drives.

However, a large convergent literature from game theory, evolutionary psychology and neuroscience indicates that emotional behaviour can often be understood as optimal in various real-world situations.

(TODO: There’s a missing premise here - something something shared context / game structure / optimisation type)

Critically, if this behaviour is optimal, it can also arise in agents trained by RL, and so potentially in advanced AI.

Here, we aim to show that behaviours analogous to anger, loyalty, and play can emerge in toy reinforcement learning environments. This implies that it is possible for a large-scale AI trained using RL to also express such things as it interacts with the real world.

This matters because it would be some evidence for a further failure mode for advanced AI beyond the utility maximiser with harmful convergent instrumental goals.

Researchers rightly discount just using ‘person’ as a model for AI, and call this ‘anthropomorphism’. However, we hypothesise that a natural replacement for this model, ‘person without emotion’, is potentially an even worse model, since it completely discounts an apparent failure mode. It seems possible that ‘person with emotion’ (or, to spell it out, ‘agent executing stable strategies in environments with a similar decision structure as led to animal emotion’) is less misleading.

It is common to counterpose the ‘emotional’ and the ‘rational’. In this paper we do not: instead, ‘rational’ denotes optimality according to (perhaps latent) goals, and we argue that emotional expressions can be rational in this sense. This could have been called `second-order rationality’: rational outcomes without (first-order) rational processes.

Scope

“Emotion” demands disambiguation: as LeDoux notes,

there is little consensus about what emotion is, and how it differs from other aspects of mind and behavior, in spite of discussion and debate that dates back to the earliest days in modern biology and psychology.

An exception is the consensus that emotions are in some way functional.

So begin by cutting back our scope. First distinguish:

Emotional stimuli: sense-data that reliably cause some emotion in some organism
Emotion: low-level unconscious processing of a survival-relevant stimulus.
Mood: less-specific processing; a disposition to emotion, persisting across individual stimuli.
Feeling: the subjective result of emotional processing.
Emotional expression: behaviour caused by emotional processing. Henceforth just `expression’.

We discuss only expressions, and in a figurative sense: the interpretation of agent behaviour in terms of recognisable reference emotions.

(TODO: taking the intentional stance to it: https://psycnet.apa.org/record/1991-98265-002)

We can further divide the functions of expressions into

adaptive. Low-level physical functions (for instance, maybe fear widens the eyes to expand the visual field for threats).
motivational. Intrapersonal aid to (e.g. internal reward) for exploration or exploitation of our environment.
social. Interpersonal aid to interacting with others. Signalling intent, traits, and strategies.

(TODO: probably not relevant. social further breaks down into interpersonal ‘referencing’: evidence about peer intentions. `expressions of others as a source of information to make decisions about their own behavior.’ cultural. latent reproduction of social structure. cultural display rules)

LeDoux: roles of emotional stimuli. We only study (2)

Survival Circuit Trigger Stimulus: Activates a specific survival circuit
Incentive: Modulates instrumental goal-directed behavior to help meet the opportunity or challenge signaled by the stimulus that is triggering activation of a specific survival circuit
Reinforcer: Supports the learning of Pavlovian or instrumental associations

Our experiments all bear on the social emotions, those involved in signalling preferences and strategies to other agents.

Some emotions seem trivial when divorced from their physiological and subjective aspects. For instance, an RL agent which repeatedly incurs negative reward from an avoidable object will predictably give way to it; it is not clear what is gained from interpreting this behaviour as fearful. But a nontrivial case is where apparently excessive weight is given to a ‘feared’ stimulus in anticipation: for instance, not just giving way, but avoiding the entire area. The expression would then be rational if the stimulus is stochastic, or if only partial information on it is available.

Conversely, there are emotions generally agreed to be useful in AI. Curiosity-based RL is the clearest instance: to solve the explore/exploit problem, a drive toward novel experiences is helpful (for instance, an internal reward, separate from the main environment reward). This stems from 2004 work by Barto et al (see also), and curiosity mechanisms were a fruitful approach to tasks with extremely delayed reward. In fact this is a convergent phenomenon: either curiosity is an explicit part of the agent’s learning algorithm, or this drive gets invented during meta-learning.

Other possible `single-player’ emotions include the manipulation drive and boredom. But our focus is on expressions that could arise via learning, or meta-learning, in multi-agent settings.

The vague idea of designing emotionally expressive agents (not just intrinsically rewarded agents) was explored by Braitenberg as early as 1984 (see also), but we seek to instead demonstrate expressions in ordinary RL algorithms, i.e. those without explicit emotion-simulating mechanisms. So our target class is emergent social expressions.

We present several experiments to test whether social emotional expressions emerge in simple agents and environments.

Background

Obviously many fields have studied emotional expressions: less obvious is that they found convergent results. Start with the idea of latently or indirectly instrumental action: ‘functional’ behaviour in anthropology \& sociology is analogous to ‘adaptive’ behaviour in evolutionary science, and to ‘rational’ behaviour in economics and neuroscience.

Putting it all together, the key ideas for us are: stable strategy, precommitment, and temporal difference.

Emotions in Sociology and Evolutionary Psychology

In Merton’s account, human behaviours serve both a ‘manifest’ (individual, conscious, endorsed) and ‘latent’ (unconscious, adapted) function. Some latent functions are functional for the individual (for instance, Miller’s account of artistic activity as mating display), while others function to reproduce social structures (for instance, marriage as the reproduction of a particular family structure, as opposed to its manifest function of expressing love).

Neither evolutionary psychology or economics require the agent to understand the latent functions of their behaviour - and in fact one programme in evo psych concerns awareness being actively counterproductive (for, for instance, deception).

(TODO: I swear this is relevant https://link.springer.com/article/10.1007/s13752-017-0290-6)

(TODO: makes sense, but is this actually how people think?:)

One convenient way of characterising emotions is by their evolutionary age (and so by their specificity to humans). It’s a continuous scale, but for brevity we could choose three labels:

Primal. e.g. anger, fear, disgust, sadness, enjoyment. The most ancient category, and so the most universal across species. Recognised across animalia. See Panksepp.
Filial. i.e. family nurturing: love and loyalty. Recognised across viviparous vertebrates (and some oviparous) Again, just Panksepp.
Social. e.g. envy, humour, upset. Recognised in mammals and some other species.

The so-called `primal’ emotions are the most promising for our purposes, since they may reflect simple contexts, less contingent on particular facts about humans.

(Needlessly broad claim: Psychology (or anyway its cognitive, social, and positive subfields) has tended to focus on feelings.)

The categorisation of particular emotions into particular classes is highly contentious. (As is the question of which, if any, emotions can be considered ‘basic emotions’.) (Note the connection to Omohundro’s theory though.) We do not rely on any particular account, just an associated premise: that common social problems (or Games) are a plausible explanation for emotions in common across cultures and species.

A distinct idea in evolutionary psychology is that these emotions were previously adaptive: that they evolved slowly, and that the context in which they were optimal may have changed (for instance, the size of our society which has increased a million-fold), leaving

The key to putting an evolutionary account to work is the idea of an evolutionarily stable strategy (Maynard Smith and Price 1973, Axelrod):

Let $V (A, B)$ be the expected reward yielded to the player of strategy $A$ played against strategy $B$. A strategy $M$ is considered evolutionarily stable if M is the majority strategy and, for all alternative `invasion’ strategies $I$, either (1) or (2) obtains:

$V (M, M) > V (I, M)$$$

or both

\[V (M, M) = V (I, M) \,\, \mathrm{and} \,\, V (M, I) > V (I, I)\]

A classic example is that of submission behaviour: the pair “submit in the face of aggression” (by exposing the throat, or obscuring your own vision with tears) and “show mercy to submissives” is superior to other strategy pairs (for instance, “fight to the death against the odds” and “show no mercy”).

For evolutionary accounts of the latent functions of many high-level human behaviours, including love and anger, see Simler & Hanson. The evolutionary analysis of multi-agent learning is surveyed in Bloembergen.

Emotions in Game Theory

How do emotions make sense on a purely tactical view? First note that interacting with others has a commitment problem: the stable strategies (2) above can only be used if your fellow players can trust that you will actually use them, even when you have (short-term) incentives not to (Schelling 1958, Ross 2004).

Emotions can help with this by precommitting a player to a strategy in a way which is expensive or hard-to-fake (and thus sincere). Anger (excessive aggression) has been proposed as a mechanism for signalling to co-operative players that you will behave aggressively if crossed, even when it costs you (Frank 1988). This changes the expected payoff structure for the other player, incentivising co-operation; and to the extent that you are credible, you will not actually have to take the costly action, which keeps the strategy competitive.

To function as a strategic signal, an emotion must be hard to fake; to function in a stable strategy, they must benefit both players (by helping to create and maintain cooperation). This lens says `the point of emotions is credible precommitment; the point of that is smoothing and incentivising co-operation’.

(TODO: How well do game-theoretic explanations actually work for irrational behaviour?)

Concretely: If you credibly precommit to a Grim strategy (co-operate until defected against, then defect forever), you get a better outcome overall (e.g. cooperate-cooperate in the Prisoner’s Dilemma), rather than the usual defect-defect Nash equilibrium. This is a toy proof of concept: it can be rational to precommit to following a policy which deviates from the rational (that is, from the first-order marginal rational action).

%Once is the key function of the “social” emotions: loyalty/anger. And it nicely captures the emotional = irrational NOW, but potentially rational over longer timescales.

A subfield, behavioural game theory, tests the fit of these theories to strategies actually found in human decision-making.

Emotions in Neuroscience and RL

Back to LeDoux: `unconditioned and conditioned emotional stimuli… can also be described as incentives - stimuli that motivate instrumental behavior. The same stimuli additionally function as reinforcers - stimuli that strengthen the probability that an instrumental response will be learned and later performed’.

Yes: If game theory grounds the incentive side of rational emotions, then RL is to ground the computational side. By focusing on survival functions instantiated in conserved circuits, key phenomena relevant to emotions and feelings are discussed with the natural direction of brain evolution in mind (by asking to what extent are functions and circuits that are present in other mammals also present in humans) rather than by looking backward, and anthropomorphically, into evolutionary history (by asking whether human emotions/feelings have counterparts in other animals).

An early study of some emotions (fear, hope, relief, disappointment) in terms of natural reinforcement is Mowrer (1960). Broekens et al have given RL accounts of joy, hope, fear, distress, and regret. Work in computational psychiatry also uses RL as a SOTA model for human mood disorders.

The Prisoner’s dilemma is probably enough to study anger. If defection is a Nash equilibrium, cooperation needs to be forced, making this suitable for investigating anger.

The Grim Trigger strategy is a simple model of threatening behaviour. A player following the Grim Trigger strategy starts by cooperating in the first round and all subsequent rounds as long as the opponent does not defect. If the opponent defects, even once, then the player will defect for every future round. If the opponent knows that this is the strategy the player is using then this can be seen as a threat; before the game has started player A threatens player B with this punishment to change B’s behaviour.

Exemplar expressions

Anger

Or more specifically vengefulness or spite.

TODO: Types of aggression in ethology:

Conspecific (competition, esp reproductive)
Defensive (vs predator)
Predatory (vs prey)

Threats:

Conspecific
Predators
Env Hazards

Two clear functions: response to defection, and response to unfairness.

Grudges map easily onto the `grim strategy’ or persistent punishment of defection.

We operationalise anger as ‘irrationally’ uncooperative strategies.

Loyalty

We operationalise loyalty as ‘irrationally’ cooperative strategies.

Loyalty is the logical inverse of anger.

Could we do something on in-group out-group behaviour?
Can we model submissive / dominant behaviour as an asymmetric coordination game?
consider a game where a game where a “dominant” player can be either oppressive or cooperative, and the “submissive” player can be either cooperative or rebellious.
As a coordination game, it has oppressive-rebellious and cooperative-cooperative axes like Nash Equilibria, but where cooperative-cooperative is better for both players.
Now it may be worth the oppressive player “signalling” or precommitting to be cooperative. The really interesting thing is if we consider a symmetric, four-action version of the game, where each player can choose dominant/suppressive, but where the Nash equilibria still lie in the same places.
Now it might still be worthwhile for one player to precommit to be submissive cooperative, to avoid many other worse potential outcomes.

Play

Many species have `play signals’, contextual cues that allow for learning norms and testing boundaries without incurring punishment.

We operationalise play by contriving environments where a new agent must be be able to learn the existing norms without getting punished or destabilising the equilibrium.

Power-hunger

Dominance is related but has a sadistic component - positive reward derived from your negative reward - which is hopefully implausible unless explicitly introduced.

(TODO: How would this work though? Long-time horizon? Ambiguous rewards?)

One possibility: power-hunger comes if co-operation is difficult, e.g. actions don’t cleanly lead to intended outcomes, but are destabilised by noise from the environment.

Evaluation

The main weakness of this is that the evaluation is qualitative and subjective.

Given sufficiently rich environments, it may be possible to adapt psychometric tests and tests of animal cognition to our emotional machines.

Open questions

Does it make sense to frame these stable strategies as metalearning?
How do continuous games change matters? Can the grim strategy emerge there?
Is there any expression more basic than the grim strategy?
How does inter-agent communication change things? This immediately suggests second-order dynamics like lying and the deterrence of lies. This raises the possibility of a second-order grim strategy: A precommits to grim if B plays grim. Trust?
If emotions are in fact a stable strategy, then they should be explored with respect to a population, in the mode of evolutionary dynamics). This could be a family of RL agents.
A relatively common model of human emotion, `Boltzmann rationality’, treats irrationality as temporary noise added to an optimal policy. Another idea in the RL setting: Emotion as a temporary change in the reward function being optimised.
From the three-player case on we get the possibility of coalitions, and the associated more complex expressions of loyalty, and in-group out-group behaviour.