Reproductive skew, fighting costs and winner–loser effects in social dominance evolution

Abstract Social hierarchies are often found in group‐living animals and can be formed through pairwise aggressive interactions. The dominance rank can influence reproductive success (RS) with a skew towards high‐ranking individuals. Using game theory, we investigate how the opportunity for differently ranked individuals to achieve RS influences the costs of hierarchy formation and the strength of winner and loser effects. In our model, individuals adjust their aggressive and submissive behaviour towards others through reinforcement learning. The learning is based on rewards and penalties, which depend on relative fighting ability. From individual‐based simulations, we determine evolutionary equilibria of traits such as learning rates. We examine situations that differ in the extent of monopolisation of contested RS by dominants and in the proportion of total RS that is contested. The model implements two kinds of fighting costs: a decrease in effective fighting ability from damage (loss of condition) and a risk of mortality that increases with the total accumulated damage. Either of these costs can limit the amount of fighting. We find that individuals form stable dominance hierarchies, with a positive correlation between dominance position and fighting ability. The accumulated costs differ between dominance positions, with the highest costs paid by low or intermediately ranked individuals. Costs tend to be higher in high‐skew situations. We identify a ‘stay‐in, opt‐out’ syndrome, comprising a range from weaker (stay‐in) to stronger (opt‐out) winner–loser effects. We interpret the opt‐out phenotype to be favoured by selection on lower ranked individuals to opt out of contests over social dominance, because it is more pronounced when more of the total RS is uncontested. We discuss our results in relation to field and experimental observations and argue that there is a need for empirical investigation of the behaviour and reproductive success of lower ranked individuals.


Observations and actions
The model simplifies a round of interaction into two stages. In the first stage, interacting individuals make an observation. Thus, individuals observe some aspect ξ of relative fighting ability and also observe the opponent's identity. The observation by an individual is statistically related to the difference in fighting ability between itself and the opponent, q i − q j . For the interaction between individuals i and j at time t, the observation is where a 0 > 0 and ijt is an error of observation, assumed to be normal with mean zero and SD σ. Note here that the observations ξ ijt refer to the original fighting abilities q i and q j , and not the effective fighting abilities (see below). By adjusting the parameters σ q , which is the SD of the distribution of q i , and a 0 and σ from equation (S1), one can make the information about relative quality more or less accurate. The observation (ξ ij , j) is followed by a second stage, where individual i chooses an action, and similarly for individual j. The model simplifies to only two actions, A and S, corresponding to aggressive and submissive behaviour.

Action preferences and estimated values
For an individual i interacting with j at time t, l ijt denotes the preference for A. The probability that i uses A is then so that the preference l ijt is the logit of the probability of using A. The model uses a linear (intercept and slope) representation of the effect of ξ ijt on the preference, and expresses l ijt as the sum of three components Here h iit = f i θ iit is a contribution from generalisation of learning from all interactions, h ijt = (1 − f i )θ ijt is a contribution specifically from learning from interactions with a particular opponent j, and γ 0i ξ ijt is a contribution from the current observation of relative fighting ability. Note that for f i = 0 the learning about each opponent is a separate thing, with no generalisation between opponents, and for f i = 1 the intercept component of the action preference is the same for all opponents, so that effectively there is no individual recognition (although the observations ξ ijt could still differ between opponents). One can similarly write the estimated valuev ijt of an interaction as a sum of three components: The actor-critic method updates θ iit , θ ijt , w iit , and w ijt in these expressions based on perceived rewards, whereas f i , γ 0i , and g 0i are genetically determined.

Exploration in learning
For learning to be efficient over longer time spans there must be exploration (variation in actions), in order to discover beneficial actions. Learning algorithms, including the actor-critic method, might not provide sufficient exploration (Sutton and Barto 2018), because learning tends to respond to short-term rewards. In the model, exploration is implemented as follows: if the probability in equation (S2) is less than 0.01 or greater than 0.99, the actual choice probability is assumed to stay within these limits, i.e. is 0.01 or 0.99, respectively. In principle the degree of exploration could be genetically determined and evolve to an optimum value, but for simplicity this is not implemented in the model.

Fighting damage and effective fighting ability
A group member i accumulates damage D it from fighting. D it refers to accumulated damage up to (but not including) round t. As a consequence of the damage, the individual's effective fighting ability is reduced from the original q i tô where c 0 is a parameter. Following an AA round between i and j, there is an increment to D it : and similarly for j. The effective fighting abilities also determine the perceived costs (see below), and in this way they influencing the learning.

Perceived rewards
An SS interaction is assumed to have zero rewards, R ijt = R jit = 0. For an AS interaction, the aggressive individual i perceives a reward R ijt = v i , which is genetically determined and can evolve. The perceived reward for the submissive individual j is zero, R jit = 0, and vice versa for SA interactions. If both individuals use A, some form of costly interaction or fight occurs, with perceived costs (negative rewards or penalties) that are influenced by the effective fighting abilities of the two individuals. The perceived rewards of an AA interaction are assumed to be where e ijt is a normally distributed random influence on the perceived penalty, with mean zero and standard deviation σ p , and similarly for e jit .

Learning updates
In actor-critic learning, an individual updates its learning parameters based on the prediction error (TD error) which is the difference between the actual perceived reward R ijt and the estimated valuev ijt . The learning updates for the θ parameters are given by is referred to as a policy-gradient factor and α θi is the preference learning rate for individual i. Note that ζ ijt will be small if p ijt is close to one and individual i performed action A, which slows down learning, with a corresponding slowing down if p ijt is close to zero and S is chosen. There are also learning updates for the w parameters given by where α wi is the value learning rate for individual i. The updates to the policy parameters θ can be described using derivatives of the logarithm of the probability of choosing an action with respect to the parameters. Using equation (S2), we obtain for the derivative of the logarithm of the probability of choosing an action, A or S, with respect to the preference for A, which corresponds to equation (S10). From equation (S3) it follows that and this gives the learning updates of the θ parameters in equation (S9). The updates of the w parameters of the value function can also be described using derivatives. From equation (S4) it follows that and this gives the learning updates of the w parameters in equation (S11).

Bystander updates
As in Leimar (2021), bystander effects are modelled as observational learning. When there is a dominance interaction in a group, individuals other then the interacting pair i and j, for instance an individual k, can use the outcome to update the learning parameters. Assume that individual k only performs this updating if i and j end their interaction by using AS or SA (because there is no clear 'winner' in AA and SS interactions, and bystanders do not perceive the costs of AA interactions). The probabilities for individuals i and j to use A are p ijt and p jit , from equation (S2). These are 'true' values and are not known by individual k. However, given that the outcome is either AS or SA, one readily derives that the logit of the probability that it is AS is From equation (S3) one can see that this involves various learning parameters for i and j. For bystander learning an assumption is needed about how an individual k represents this logit. A simple assumption is that k represents the logit as which entails that k does not use any information about q i or q j . The assumption is reasonable in that a large θ kit means that k behaves as if individual i is weak, and similarly for θ kjt . Using the notation the bystander updates by k is assumed to be The parameter β k is a measure of how salient or significant a bystander observation is for individual k, and this parameter is assumed to be genetically determined and can evolve. These bystander updates are similar to the direct-learning updates of the actor component of the actor-critic model and were used in Leimar (2021). There is also the possibility that the salience for a bystander of a contest outcome is influenced by additional information the bystander might have, either from current or from previous observations. For instance, if the observations in equation (S1) are about relative size, a bystander might have estimates ξ ki and ξ kj of its own size in relation to the contestants i and j. Instead of the bystander updates above we might then have if i wins (outcome is AS) and if j wins (outcome is SA). These updates entail that a bystander k pays particular attention to wins by an individual perceived to be bigger than itself, and to losses by an individual perceived to be smaller. We used these updates in our simulations.

Life-history and reproductive season
There is an annual life cycle with a single reproductive season. Dominance interactions occur in groups of size g s , with g s = 8 for the individual-based simulations in Table  S1. The season starts with a sequence of contests. Each contest is between a randomly selected pair of group members and there are 5g s (g s − 1) contests, i.e., on average 10 contests per pair. An individual's survival from the contests to reproduction depends on the accumulated damage. As a result of the contests, a dominance hierarchy is formed, and surviving group members acquire reproductive success (RS) according to their ranks. The purpose of the scheme is to implement a combination of hierarchy formation, resource acquisition, and mortality over the season in a way that allows both fitness benefits and costs to influence trait evolution. In principle, similar results could be achieved by, for instance, implementing a risk of mortality after each contest, or even after each round of interaction.

Contests
If a dominance relation has already been established between contestants i and j, there is no interaction. If not, the contestants go through a number of rounds, at minimum 10 rounds and at maximum 200 rounds of interaction. If there are 5 successive rounds where i uses A and j uses S (5 AS rounds), the contest ends and i is considered dominant over j, and vice versa if there are 5 successive SA rounds. Further, the contest ends in a draw if there are 5 successive SS rounds.

Mortality from fighting damage
An individual with accumulated damage D it survives from contests to reproduction with probability

Dominance ranking and reproductive success
The ranking is among surviving individuals and is based of how many other group members an individual dominates (this measure is referred to as a score structure by Landau (1951). If some individuals dominate the same number of other group members, their relative rank is randomly determined (this happened occasionally in our simulations). As an extreme example, if all individuals would use action S in the contests, there would be no real dominance hierarchy, because each would dominate zero other group members, and all ranks would be randomly determined (this never happened in our simulations). Surviving group members acquire reproductive success. A local group, containing 8 interacting individuals (if all survive) and 8 of the other sex, produces an expected number of 16 offspring. For each offspring, one parent of the interacting sex is drawn from the group with a probability proportional to ρV (k) + (1 − ρ) (Fig. 1a), where k is the rank, and a parent of the other sex is randomly drawn. For instance, if a linear hierarchy has been established and all survive, an individual with rank k (with k = 1 the top rank) obtains an expected RS of ρV (k) + (1 − ρ). In the next generation, each offspring disperses to a random local group. In this way, interacting individuals are unrelated.

Elo rating
Several approaches to Elo ratings have been used, differing in such things as the zero point of the scale and the amount to change ratings after a 'win' by one individual over another, or after a 'draw'. There is similarity between updates of Elo ratings and the updates of action preferences for actor-critic learning described above. Here, however, we use the Elo rating just as a conventional measure or index of dominance rank, without further interpretation of what the scores might mean. The possible usefulness of this measure needs instead to be investigated. From our results here, Elo ratings appear useful in providing a description of a dominance hierarchy.
Let E it be the Elo rating of group member i at time t. Initially all have rating E i0 = 0. If a contest between i and j ends with i becoming dominant over j, E it is incremented by where The Elo rating of j, E jt , is decremented by the same amount. If the contest ends in a draw, E it is decremented by and E jt is incremented by this amount. It can help the interpretation to think of P ijt as the probability, before the interaction, of the outcome ('win', 'loss', or 'draw'). This, however, is just an interpretation that helps explaining why Elo ratings are defined in a certain way. Dominance relations differ from wins and losses in a tournament, so it is not certain that Elo ratings are useful for predicting outcomes of dominance interactions. One can, of course, investigate the usefulness for each particular case. 2.49 ± 0.24 −0.06 ± 0.03 1.03 ± 0.23 0.10 ± 0.03 0.78 ± 0.02 10 4.72 ± 0.36 0.05 ± 0.04 0.57 ± 0.36 0.07 ± 0.03 1.02 ± 0.03 11 3.92 ± 0.27 −0.05 ± 0.03 0.66 ± 0.26 0.08 ± 0.03 0.92 ± 0.03 12 3.47 ± 0.26 −0.08 ± 0.04 0.86 ± 0.27 0.11 ± 0.02 0.84 ± 0.02 Table S2: Same as Table S1, but for cases with no cost of loss of fighting ability (c 0 = 0) and higher mortality cost (c 1 = 0.002). Reproductive parameters, mean survival, and trait values (mean ± SD over 100 simulations, each over 5000 generations) for 12 different cases of individual-based evolutionary simulations of social dominance interactions.
The dashed grey curves shows the effect of instead changing the nature of fighting damage, by eliminating the loss of effective fighting ability (i.e. putting c 0 = 0). Note that these changes are imposed on the situation of case 1 in Table S1, and do not represent evolutionary changes to different conditions.