Attention to Entropic Communication

The concept of attention has proven to be very relevant in artificial intelligence. Relative entropy (RE, aka Kullback‐Leibler divergence) plays a central role in communication theory. Here, these concepts, attention, and RE are combined. RE guides optimal encoding of messages in bandwidth‐limited communication as well as optimal message decoding via the maximum entropy principle. In the coding scenario, RE can be derived from four requirements, namely being analytical, local, proper, and calibrated. Weighted RE, used for attention steering in communications, turns out to be improper. To see how proper attention communication can emerge, a scenario of a message sender who wants to ensure that the receiver of the message can perform well‐informed actions is analyzed. In case only the curvature of the utility function maxima are known, it becomes desirable to accurately communicate an attention function, in this case a by this curvature weighted and re‐normalized probability function. Entropic attention communication is here proposed as the desired generalization of entropic communication that permits weighting while being proper, thereby aiding the design of optimal communication protocols in technical applications and helping to understand human communication. It provides the level of cooperation expected under misaligned interests of otherwise honest communication partners.

The relative entropy between two probability densities P(s|I A ) and P(s|I B ) on an unknown signal or situation s ∈ S D s (I A , I B ) := S ds P(s|I A ) ln P(s|I A ) P(s|I B ) measures the amount of information in nits lost by degrading some knowledge I A to knowledge I B .The letters A and B stand for Alice and Bob, who are communication partners.Alice can use the relative entropy to decide which message she wants to send to Bob in order to inform him best.I A is Alice's background information before and after her communication.
We assume that Alice knows how to communicate such that Bob updates his previous knowledge state I 0 to I B . 1he functional form of the relative entropy can be derived from various lines of argumentation [27][28][29][30][31][32].As the most natural information measure relative entropy plays a central role in information theory.It is often used as the quantity to be minimized when deciding which of the possible messages shall be sent through a bandwidth limited communication channel that does not permit the full transfer of I A , but also in other circumstances.
As we will discuss in more detail, relative entropy as specified by Eq. 1 is uniquely determined up to a multiplicative factor as the measure to determine the optimal message to be sent under the requirements of it being analytical (all derivatives w.r.t. to the parameters in I B exist everywhere), local (only the s that happens will matter in the end), proper (to favor I B = I A ) [8,33,34], and calibrated (being zero when I B = I A ).Our derivation is a slight modification of that given in Leike and Enßlin [30].
A number of attempts have been made to introduce weights into the relative entropy [32,[35][36][37][38][39][40].Some of these go back to Guiasu [41] and Belis and Guiasu [42].Most of them can be summarized by the weighted relative entropy, D (w)  s (I A , I B ) := S ds w(s) P(s|I A ) ln P(s|I A ) P(s|I B ) , with weights w(s), for which w(s) ≥ 0 holds for all s ∈ S.
The extension of relative entropy to weighted relative entropy, as given by Eq. 2, appears to be attractive, as it can reflect scenarios in which not all possibilities in S are equally important.For example, detailed knowledge on the subset of situations S B dead ⊂ S in which Bob's decisions do not matter to him are not very relevant for the communication, as he can not gain much practical use from it.Therefore, Alice should not waste valuable communication bandwidth for details within S B dead , but use it to inform him about the remaining situations S B alive = S\S B dead for which being well informed makes a difference to Bob.However, despite being well motivated, weighted relative entropy is not proper in the mathematical sense for nonconstant weighting functions, as we will show and was already recognized before [32].

Relative Attention Entropy
Given the success of attention weighting in artificial intelligence research [43][44][45], in particular in transformer networks [45], the question arises whether a form of weighted relative entropy exists that is proper.In order to understand how the weighting should be included we investigate a specific communication scenario, in which weighted and re-normalized probabilities of the form A (w) (s|I) := w(s) P(s|I) S ds ′ w(s ′ ) P(s ′ |I) appear naturally as the central element of communication.
We will call a quantity of this form attention function, attention density function, or briefly attention 2 when there is no risk of confusion with other attention concepts (as within this paper).
The corresponding relative attention entropy 2 The term "attention" seems appropriate for this quantity: Attention is the concentration of awareness on some phenomenon to the exclusion of other stimuli.[46] It is a process of selectively concentrating on a discrete aspect of information, whether considered subjective or objective [47].If "awareness on some phenomenon" can be read as the probability associated with it, the weighting done in our construction of attention then concentrates the awareness on relevant information.
leads to proper communication, in case w(s) > 0 for all s ∈ S, as we show in the following.The minimum of the relative attention entropy is given by A (w) (s|I B ) = A (w) (s|I A ), which in case w(s) > 0 for all s ∈ S implies P(s|I B ) = P(s|I A ) since P(s|I B ) = A (w) (s|I B )/w(s) S ds ′ A (w) (s ′ |I B )/w(s ′ ) = A (w) (s|I A )/w(s) S ds ′ A (w) (s ′ |I A )/w(s ′ ) = P (w) (s|I A )/ S ds w(s ′′ ) P(s ′′ |I A ) S ds P (w) (s ′ |I A )/ S ds w(s ′′′ ) P(s ′′′ |I A ) = P(s|I A ). ( Here we first turned the attention back into a probability by inversely weighting and re-normalization, then substituted A (w) (s|I B ) by A (w) (s|I A ) thanks to their identity, further substituted the latter by its definition in terms of P(s|I A ), and finally used the normalization S ds P(s|I A ) = 1.
The relative attention entropy differs from weighted relative entropy (Eq.2) due to the re-normalization in the definition of attention, which leads to an irrelevant re-scaling of weighted relative entropy, since independent of I B , but also to a relevant extra term that depends on I B : s (I A , I B ) −ln S ds w(s) P(s|I A ) S ds w(s) P(s|I B ) S ds w(s) This extra term ensures properness when the attention entropy gets minimized w.r.t.I B .We refrained here to give similar integration variables different names.
We investigate the specific scenario of Alice wanting to inform Bob optimally.From this, we will motivate attention according to Eq. 4. This means that she wants to prepare Bob such that he can decide about an action a ∈ A in a for him optimal way.This action has an outcome for Bob that depends on the unknown situation s Alice tries to inform Bob about.The outcome is described by a utility function u(s, a), which both, Alice and Bob want to maximize.The utility depends on both, the unknown situation s and Bob's action a.In the following, we assume u(s, a) to be at least twice differentiable w.r.t. a and to exhibit only one maxima for any given s.In case Alice does not know Bob's utility function, but its curvature w.r.t. a at its maxima for any given s, it will turn out that Alice wants to communicate her attention A (w) (s|I A ) as accurately as possible to Bob, where w(s) is the curvature of the utility w.r.t.a.In short, we will show that Alice should inform Bob most precisely about situations in which Bob's decision requires the largest accuracy and not at all about situations in which his actions do not make any difference.The functional, according to which Alice will fit I B to I A , will not be the relative attention entropy of Eq. 4, but a different one.However, it motivates attention as an essential element of utility aware communication and shows the path to extend the derivation of relative entropy to that of relative attention entropy.

Attention Example
To illustrate how relative attention entropy works in practice, we investigate an illustrative example.For this we assume that Alice has a bimodal knowledge state on a situation s about which she wants to inform Bob, which is a superposition of two Gaussians, with This is shown in Fig. 1.
Let us assume that Alice believes that the different situations have importance weights w(s) = e λ s for Bob, with λ controlling the inhomogeneity of the weights.We further assume that her communication can only create Gaussian knowledge states in Bob of the form The corresponding attentions of Alice and Bob are respectively, as shown in App.A, which also covers the details of the following calculations.These functions are displayed in Fig. 2 for smaller values of λ.
Matching the parameters m and σ 2 B by minimizing the relative attention entropy in Eq. 4 w.r.t.those yield The resulting communicated attention weighted knowledge is depicted in Fig. 1 for various values of λ.The case λ = 0 corresponds to homogeneous weights and therefore to using the non-weighted relative entropy, Eq. 1.In this case Alice communicates a broad knowledge state to Bob that covers both peaks of her knowledge.With increasing λ the right peak becomes more important and Alice puts more and more attention on communicating its location and shape more accurately.In the limit of λ → ∞ we have m = 1 and σ 2 B = σ 2 A , meaning that Alice communicates exactly this relevant peak at s = +1 and completely ignores the for Bob irrelevant one at s = −1.
Minimizing the weighted entropy of Eq. 2 instead, yields P(s| which results in poorly adapted communicated knowledge states, as also depicted in Fig. 1.Weighted relative entropy just moves the broad peak centered originally at zero for both entropies in case of no weight (λ = 0) to increasingly more extreme locations, which for λ → ∞ become completely detached from the location of the relevant peak.This detachment is clearly visible in Fig. 1 and Fig. 2.
In order to see how the process of Alice informing Bob works in detail, we have to understand how he decodes messages, as this defines the format of the messages Alice can send.The best way for him to incorporate knowledge sent to him into his own beliefs is by using the MEP, which we assume he does in the following.

Maximum Entropy Principle
The MEP was derived as a device to optimally decode messages and to incorporate their information into preexisting knowledge [2][3][4][5][6]48].The MEP states that among all possible updated probability distributions that are consistent with the message, the one with the largest entropy should be taken.Requiring that this update should be local, reparametrization invariant (w.r.t. the way the signal or situation s ∈ S is expressed), and separable (w.r.t.handling independent dimensions of s) enforces the functional form (up to affine transformations) of this entropy to be with I 0 being Bob's knowledge before and I B after the update. 3e assume that Alice's message takes the form We call the function f : S → R n the topic of the communication, as it expresses the specific aspects of s that Alice's message is about.E.g. in case Alice wants to inform Bob about the first moment of the first component of s ∈ R n , the topic would be f (s) = s 1 .As Alice communicates it, the topic is known to both, Alice and Bob.Here, ⟨f (s)⟩ (s|I A ) := S ds f (s) P(s|I A ) is a compact notation for evaluating this topic as an expectation value.The communicated expectation value of f is called in the following the message data d.The word data expresses that a quantity can be regarded as a certain fact, which still might have an uncertain interpretation.In this case it is certain to Bob Updating his knowledge according to the MEP after receiving the message implies that Bob extremizes the constrained entropy w.r.t. the arguments I B and µ.Here, the latter is a Lagrange multiplier that ensures that Alice's statement, Eq. 17, is imprinted onto Bob's knowledge P(s|I B ).The MEP and the requirement of normalized probabilities then imply that Bob's updated knowledge is of the form Z(µ, f ) := ds P(s|I 0 ) exp µ t f (s) .
The Lagrange multiplier µ needs to be chosen such that ⟨f (s)⟩ (s|I B ) = d, which can be achieved by requiring since Thus, the MEP procedure ensures that the communicated moment of Alice's knowledge state, ⟨f (s)⟩ (s|I A ) = d, gets transferred accurately into Bob's knowledge,  and thus that Bob extracts all information Alice has sent to him.Alice's communicated expectation for the topic, d = ⟨f (s)⟩ (s|I A ) , is now "imprinted" onto Bob's knowledge, as ⟨f (s)⟩ (s|I B ) = d as well after his update.
If Bob decodes Alice's message via the MEP, she can send a perfectly accurate image of her knowledge if the communication channel bandwidth permits for this.Detailed explanations on this can be found in App. C. Otherwise, she needs to compromise and for this requires a criterion on how to do so.
The protocol of the entropy-based communication between Alice and Bob looks therefore as follows: Both know: The situation/signal s is in a set S of possibilities, on which they have information I A and I 0 , respectively, implying for each a knowledge state P(s|I A ) and P(s|I 0 ), respectively.
Alice sends: A function f : S → R n , the message topic, and her expectation for it, d = ⟨f (s)⟩ P (s|I A ) , the message data.
Bob updates: I 0 → I B , according to the MEP, such that his updated knowledge P(s|I B ) has the same expectation value w.r.t. the topic, d = ⟨f (s)⟩ P (s|I B ) .

Structure of this Work
The remainder of this work is structured as follows.Sect. 2 recapitulates the derivation of relative entropy by Bernardo [27] and Leike and Enßlin [30] and shows that a non-trivially weighted relative entropy is not proper.Sect. 3 then discusses variants of our communication scenario in which Alice's and Bob's utility functions are known to them, be them aligned or misaligned.In Sect. 4 we show that attention emerges as the quantity to be communicated most accurately in case Alice wants the best for Bob, but lack's precise knowledge on Bob's utility except for its curvature w.r.t Bob's action in any situation.This motivates the introduction of attention to entropic communication.In Sect. 5 relative attention entropy is derived in analogy to the derivation of relative entropy.We conclude in Sect.6 with discussing the relevance of our analysis in technological and socio-psychological contexts.

Proper Coding
In order to see the relation of being proper and relative entropy, we recapitulate its derivation as given by Leike and Enßlin [30] in a modified way. 4 There, it was postulated that Alice uses a loss function L(s, I A , I B ) (negative utility) that depends on the situation s that happens in the end, as well as her and Bob's knowledge at this point, I A and I B , respectively.Leike and Enßlin [30] call this function her embarrassment, as it should encode how badly she informed Bob about the situation s that happens in the end.Obviously, Alice wants to minimize this embarrassment.
At the time Alice has to make her choice, she does not know which situation s = s * will happen.She therefore needs to minimize her expected loss for deciding which knowledge state Bob I B should get (via her message m).Here we discriminate the related, but different functions labeled by L via their signatures (their sets of arguments).General criteria such a loss function should obey were formulated by [30], which we slightly rephrase here as: Analytical: Alice's loss should be an analytic expression of its arguments.An analytic function is an infinitely differentiable function such that it has a converging Taylor series in a neighborhood of any point of its domain.As a consequence, an analytical function is fully determined for all locations of its domain (assuming this is connected) by such a Taylor series around any of the domain positions.
Locality: In the end, only the case that happens matters.Without loss of generality, let's assume that s = s * turned out to be the case.Of all statements about s, only her prediction P(s * |I B ) that Alice made about s * before s * turned out to be the case, is relevant for her loss.
Properness: If possible, Alice should favor to transmit her actual knowledge state to Bob, I B = I A .
Calibration: The expected loss of being proper shall be zero.
1. Locality implies that L(s, I A , I B ) can only depend on I A and I B through q(s) := P(s|I B ) and p(s) := P(s|I A ), meaning again using the function signatures 5 to discriminate different L ′ s and introducing a Lagrange multiplier to ensure that P(s|I B ) is normalized.
Properness then requests that the expected loss should be minimal for q = p, implying for all possible s = s * 0 .
From this ∂L(s, x, y) ∂y follows, which is solved analytically by as can be verified by insertion.The Lagrange multiplier is unspecified and we choose it to be λ = 1, with the positive sign of λ ensuring that L is actually a minimum and the magnitude |λ| = 1 that the units of this loss are nits (λ = 1/ ln 2 would set the units to bits or shannons).
Calibration requests then that 0 = L(s, x, x) = ln x + c(s, x), therefore c(s, x) = − ln x, and thus, by reinsertion this into Eq.24, we find This is the relative entropy D s (I A , I B ) as defined by Eq. 1.We note that calibration is more of an aesthetically requirement, as Alice's choice is already uniquely determined by any loss functions that is local and proper.Calibration, however, makes the loss reparametrization invariant, as for any diffeomorphism , as can be verified by a coordinate transformation: Strictly speaking, Eq. 27 implies Eq. 28 only for an infinitesimal environment of y = x.Only thanks to the requirement of the loss being analytic in the full domain of its second argument (added here to the requirement set of Leike and Enßlin [30]), Eq. 28 has to hold at all other locations and Alice's expected loss becomes uniquely determined to be the relative entropy.

Weighted Coding
Could one insert weights into the relative entropy by inserting those into the above derivation?One could try to do so by modifying the locality requirement by requiring to have a minimal expectation value for Alice.Propriety requires that as follows from a calculation along the lines of Eq. 26.This is analytically solved by where we directly ensured calibration and set λ = 1.Alice's expected weighted loss is therefore (for normalized q(s) = P(s|I B )) which is exactly the same unweighted relative entropy D s (I A , I B ) as before.Therefore, weights in a relative entropy as a means of deciding on optimal coding are not consistent with our requirements.Since we have modified the locality requirement, and since calibration is not essential, the requirement that prevents weights must be properness.
As weighted relative entropy is not proper [32], we ask whether a different way to introduce weights could be proper.In order to answer this question, we turn to a conceptually simpler setting.For this we introduce the concept of Theory of Mind [49,50].This is the representation of a different mind in one's own mind.As this can be applied recursively ("I think that you think that I think ...") one discriminates Theory of Minds of different orders according to the level of recursion.The above derivation of the relative entropy is based on a second order Theory of Mind construction, namely that Alice does not want Bob to think badly about her informedness (She worries about his beliefs on her thinking).A first order Theory of Mind construction, in which Alice only cares about Bob being well informed for what matters for Bob, might be more instructive to understand how weights might emerge.We now turn to such a scenario.

Communication Scenario
In the following we investigate the scenario sketched in Fig. 3: Alice is relatively well informed about the situation s ∈ S her communication partner Bob will find himself in.In this not perfectly known situation Bob needs to take an action a ∈ A, which is rewarded by a utility u B (s * , a) ∈ R to Bob that depends on the situation s * ∈ S that will eventually occur.Bob's action also implies a utility u A (s * , a) ∈ R to Alice.We will eventually assume that Alice's and Bob's utility functions are aligned, u A (s, a) = u B (s, a) =: u(s, a), and give arguments why this might happen, but for the moment, we keep them separate in order to study the consequences of different interests.
Such misalignments are actually very common.Just imagine that Bob is Alice's young child, a is the amount of sugar that Bob is going to consume, and s is how well his metabolism handles sugar.There is a lot of anecdotal evidence for u A (s, a) ̸ = u B (s, a) under such conditions.
Alice will communicate through a bandwidth limited channel parts of her knowledge to Bob, who is here assumed to trust Alice fully, so that Bob can perform a more informed decision on his action.As stated before, Alice's message takes the form "⟨f (s)⟩ (s|I A ) = d", with f : S → R n , the conversation topic, being a moment function over the set of situations out of a limited set F of such functions she can choose from.
With increasing set size of F Alice's communication channel becomes more flexible and with increasing dimension n of the data space the channel bandwidth increases.Her message m to Bob therefore consists of the tuple (f, d) ∈ F × R n . 6n case  Alice communicates parts of her knowledge to Bob about an unknown situation.After updating according to Alice's message Bob chooses his action, which for simplicity is here assumed to be a point in situation space S (for example, the blue point could indicate to which situation his action is best adapted).His action and the unknown situation determine Bob's resulting utility.Bob chooses his action by maximizing his expected utility given his knowledge after Alice informing him (blue equal probability contours of P(s|I B ) in his mental copy of the situation space).The action and situation also determine a utility for Alice, which may or may not equal Bob's utility.Alice chooses her message such that her expected utility resulting from Bob's action is maximized in the light of her situation knowledge P(s|I A ) (red contours).In case she is honest, she can only choose which parts of her knowledge she reveals with her message by deciding on a message topic f (s); the message data is then determined to be Thus, by choosing f and d Alice can directly determine certain expectation values of Bob's knowledge, which then influence Bob's action.

Optimal Action
The action Bob chooses optimally is For simplicity we assume that this action is uniquely determined and that u B (a|I B ) is twice differentiable w.r.t.a.Then, we find that a B solves meaning that for the action a B Bob chooses, his expectation for his utility gradient g B (s, a) w.r.t.his action has to vanish.From Alice's perspective, Bob's optimal action a would be implying that a A solves

Dishonest Communication
Let's first investigate the scenario of Alice being so eager to manipulate Bob's action to her advantage that she is willing to lie for that.In order that Bob does what Alice finds optimal for herself, a B = a A , she needs to manipulate Bob's updated knowledge state I B such that according to Eq. 39.Thus, it would be advantageously for her if Bob's expected utility gradient g B (s, a A ) vanishes for the action a A , which Alice prefers him to take, since then he would take this action, a B = a A .Alice can achieve this by setting his expectation for g B (s, a A ) to zero via communicating him an appropriate deceptive message.Any message m = (f, d) with the topic being f (s) = g B (s, a A ) + c and the data d = c will achieve that Eq. 45 is satisfied.Here c ∈ R n is an arbitrary constant Alice might use to ensure f ∈ F (if this is possible) or to obscure her manipulation.
As her deceptive communication derives from Alice's knowledge and utility, these quantities imprint onto her message.This happens through the usage of her optimal action a A in g B (s, a) in Eq. 45.This by her preferred action is specified by Eq. 41.
Thus, Alice will use a communication topic f that reflects Bob's interest, as it is built on g B (•, a), however evaluated for Alice's preferred action a A in this scenario.In order for her manipulation to work, she has to make Bob believe that ⟨g B (s, a A )⟩ (s|I B ) = 0 and hope that this will let him indeed choose a = a A .
Interestingly, Alice does not need to know Bob's initial knowledge state I 0 for this, as the MEP update ensures that the relevant moment of Bob's updated knowledge I B gets the necessary value, see Eq. 45.Nevertheless, she needs to know his interests g B (s, a), as through exploiting those she can manipulate his action.
In the likely case of d ̸ = ⟨f (s)⟩ (s|I A ) , Alice would be lying.However, lying is risky for Alice, since Bob might detect her lies in the long run, being it for Bob's knowledge after Alice informing him P(s|I B ) turns out too often to be a bad predictor for s, or by other telltale signs of Alice.Bob realizing that Alice lies could have negative consequences for her in the long run, therefore we assume in the following that Alice is always honest.However, she might still follow her own interests. 7

Topics under Misaligned Interests
What Alice faces in the general case of differing interests is a complex mathematical problem.Even if Alice is bound to be honest, she still has some influence on Bob by deciding which part of her knowledge she shares.She does not need to give him information that would drive his decision against her own interests.By choosing the conversation topic smartly, Alice could make Bob acting in a way that is beneficial to both of them to some degree.
Let us assume for now that Alice knows both utility functions, Bob's and her own, as well as Bob's initial knowledge P(s|I 0 ).For a given f ∈ F used as the topic of her honest communication m = (f, ⟨f (s)⟩ (s|I A ) ), she can work out Bob's resulting updated knowledge I B =I B (f ), his action a, as well as how advantageous that action would be for her, by calculating and optimizing 7 In case Bob realizes Alice is lying, he will stop updating his knowledge according to her messages, and she will loose her ability to inform him w.r.t.s, for good or bad.For the complex dynamics that can emerge under not fully honest communication, the reader is referred to [51].Bob might even decode from her message partly what Alice's interests are, as a A and g A (s, a A ) = ∂u A (s, a A )/∂a A imprint onto her message's topic.This can even enable him to choose actions that deviate from Alice's interests as largely as he can afford in order to punish her for lying.Thus, although lying can definitively bring a short term advantage to Alice, in the long run it could cost Alice more than she might gain in the beginning.For this reason, she might decide to become honest or to be honest already in the first place.Although performing punishments typically costs Bob in terms of his own utility, he might choose them in a way that they cost Alice more than himself.This way, they might educate her to become honest, which would then let them be a good investment for Bob, as he will benefit from the information Alice can share.
The last step here, determining P(s|I B ), is also an optimization problem according to the MEP.Thus, the optimal topic for Alice therefore results from the three fold nested optimization Analytic solutions to this can only be expected in special cases.For future reference and numerical approaches to the problem, we calculate the relevant gradient ∂u A /∂f in App.D. Its component for s * is , where a = a B (f ), P(s * |I B ) = P(s * |I B (f )), and µ = µ(f ) are given by Eq. 47, 49, and 51, respectively, and δf (s) := f (s) − ⟨f (s)⟩ (s|I B ) .Inspection of the condition ∂u A /∂f = 0 and the terms that could allow for it is instructive, as it hints to the factors that drive Alice's topic choice.Alice has found a local optimal topic for her communication when either and the corresponding location is not a minimum of the expected utility.Detailed investigation of the general case of misaligned interests are left for future work.Here, only an illustrative example will be examined.

Example of Misaligned Interests
An instructive example of misaligned interests is in order.For this, let us assume that the space of possible situations as well as that of actions have two dimensions, s, a ∈ R 2 .Alice's and Bob's initial beliefs shall be Gaussian distributions G(s, S) , is shown by the background color as well as by the blue contour lines at the 1-and 2-sigma levels.Alice's more precise knowledge is indicated only via red 1-, 2-, and 3-sigma level contours.The dots mark possible actions for Bob that are optimal for him under his knowledge (blue), under Alice's knowledge (green), or optimal for Alice (red).Comparing the two panels, especially the movement of Bob's optimal action (blue dot) between them, shows that Alice informs Bob such that he chooses an action that is a compromise between their interests.s A = (1, 0) t , s B = (0, 0) t , and S A < S B = 1 (spectrally), so that Alice is better informed than Bob (since S A < S B ) and the chosen coordinate system is aligned with that knowledge (s 1 -axis is parallel to s A ). Furthermore, Bob's utility should be so that he wants his action to match the situation.Alice would prefer if he matched a by an angle φ rotated target according to her utility establishing a misalignment of their interests.In this situation, Bob's expected utility is from which his action follows.Thus, the first moment of Bob's belief on s determines fully his action and therefore Alice only needs to inform him about that.Let us therefore assume that Alice will use a topic of the form f (s) = τ t s, with τ = (cos α, sin α) t some normalized direction, so that τ t τ = 1.
We check later using Eq.53 whether this is her best choice or not.The data of her message is then and Bob's updated knowledge state becomes as verified in the following: The MEP fixes Bob's updated knowledge to be of the form Requiring that the communicated moment is matched leads to becomes maximal for the topic direction as a straightforward calculation shows.The sign of ± has to be chosen such that u A (a) is maximal, which for φ ∈ [−π, π] turns out to be +.
For mostly aligned interests, φ ≪ 1, Alice's optimal topic has an angle of α = 1 2 φ + O(φ 3 ), which means it is a nearly perfect compromise between what is optimal for her and for Bob.For α = 0 their interests are perfectly aligned and Alice informs Bob ideally with the statement "⟨s 1 ⟩ (s|I A ) = 1".
In case of orthogonal interests, φ = π/2, Alice's optimal topic angle is α = π/4, informing effectively with a statement like "⟨s 1 + s 2 ⟩ (s|I A ) = 1", which, given that ⟨s⟩ (s|I A ) = (1, 0) t , is less informative for Bob than the statement "⟨s 1 ⟩ (s|I A ) = 1" Alice would have made under aligned interests.Bob's resulting decision of a = ( 1 /2, 1 /2) t in that case turns out to be a perfect compromise between their interests, or put differently being sub-optimal for each of them to the same degree.This situation is depicted in Fig. 4.
For anti-aligned interests, φ = π, Alice's optimal topic angle is α = π/2 as then she can only send the uninformative message "⟨s 2 ⟩ (s|I A ) = 0", as revealing any more of her knowledge would be against her interests.This leaves Bob's knowledge unchanged and therefore lets him pick the action a = s B = (0, 0) t .
To summarize, misalignment of interests leads to a communication and a resulting action that are a compromise between the interests of the communication partners.Who of the two has to compromise more depends on details of their knowledge states and their interests.
Alice informing Bob sub-optimally to her own advantage bears for Alice the risk of Bob realizing this.In repeated situations, Bob might recognize the systematic misalignment angle α = angle(s, a) between the s * that happened and the topic direction τ chosen by Alice on the basis of her information on s.This might either let him question Alice's good intentions for him or her competence.In ei-ther case he could threaten Alice to ignore or even counteract against her advice until he gets convinced that she has largely aligned her interests with his.
The general optimal topic of Alice's message could, however, be a non-linear function of the situation instead of the linear f (s) = τ t s assumed above.This can be checked by inspecting the functional gradient of u A (f ) w.r.t.f as given by Eq. 53.In case it vanishes for all s ∈ S the topic was optimal.
It turns out that this gradient only vanishes when Alice's and Bob's interests are aligned (φ = 0), as shown in App.E. For the instructive case of orthogonal interests (φ = π /2), however, the gradient does not vanish.This indicates that she could construct a more sophisticated message that would pull Bob's resulting action a bit closer towards her own interest a A and further away from the action optimal for him (under her knowledge).The precise form of the optimal topic for Alice is left for future work.

Aligned Interests
In the following we assume that Alice simply wants the best for Bob from his perspective (a A = a B ) and therefore adapts his utility for herself, In this case, Alice informs Bob optimally via the message m = (f, d) with f (s) = g A (s, a A ) + c = g B (s, a B ) + c, which leads to a synchronization of their expectations w.r.t. the most relevant moment of s, ⟨f (s)⟩ (s|I A ) = ⟨f (s)⟩ (s|I B ) , and therefore to an alignment of their optimal actions, a A = a B =: a * .
For simplicity, we assume in the following that the action is described by one real number a ∈ R.An extension to a vector valued action space A = R u is straightforward, but does not add too much to the discussion below except of complexity in the notation.Furthermore we assume Alice and Bob's common utility function to be uni-modal in a given situation and to be well approximated within the relevant region by Here, b(s) is the optimal action in a given situation s, v(s) the utility of this optimal action, σ(s) ∈ R + the tolerance for deviation from the optimal action, and k ∈ N is specifying how harsh larger deviations reduce the utility.This should serve as a sufficiently generic model that can capture a large variety of realistic situations.In particular the case of a quadratic loss (=negative utility) with k = 1 mimics the typical situation in which a Taylor expansion in a around the optimal action b(s) can be truncated after the quadratic term.Alice's expected utility of Bob's action has the gradient This is a polynomial of odd order in a and therefore guaranteed to have at least one real root.The maximum of u(a|I A ) among all such roots then gives the optimal action a * .Thus, the topic function is Alice's best choice for a communication that ensures that Bob makes an optimal decision.In case b(s) = s and σ(s) = 1 this is f (s) = (a * − s) 2k−1 , which is a polynomial of order 2k − 1 in s.Instead of communicating the expectation value of this polynomial, which requires her to work out a * , Alice could simply communicate all moments up to order 2k −1 and thereby ensure that Bob would have all information needed in order to decide on the optimal action.
Here, the requirement of properness appears in a weak form.Alice wanting to inform Bob on a number of moments of her knowledge in order to put him into a position to make a good decision is a weak form of the requirement of properness.Full properness would be that Alice wants to inform Bob to know all possible moments of her knowledge.Thus, properness is expected to occur when Alice does not know Bob's utility function, but wants to support him no matter what his interests are.We will now turn to such a scenario.

Attention
We saw why Alice might align her interest with Bob's and in the following assume this to have happened, u A = u B ≡ u.Her knowledge on Bob's utility function influences how she selects her message optimally.For the concept of attention to appear in her reasoning, Alice must not know Bob's utility function in detail.In case she did, she would optimize for the utility function.However, she needs to be aware of the sensitivity with which Bob's utility reacts to Bob's choices in the different situations s in order to give those situations appropriate weights in her communication.These weights will determine how accurately she should inform about the different situations such that Bob is optimally prepared to make the right decision.
To be concrete, let Alice assume that in a given situation s Bob's utility has a single maximum at some to her unknown optimal action a * (s) = b(s) of a to her unknown height u(s, a * (s)) = v(s), but with a to her known curvature w(s).Furthermore, she assumes that this utility function can be well Taylor-approximated around any of these maxima as This corresponds to the case of Eq. 71 with k = 1 and σ(s) = [w(s)] −1/2 .We have added the parameters v, w, and b to the list of arguments of this approximate utility function as Alice needs to average over the ones unknown to her, which are v and b.
In order to circumvent the technical difficulty to deal with probabilities over functional space let us restrict the following discussion to discrete s ∈ S. In this case the parameters v, w, and b become finite dimensional vectors with components v s = v(s), etc..The case of a continuous set of situations is dealt with in App.F.
Alice assumes that Bob's action will depend on these parameters and is given by ∂u(s, a, v, w, b) ∂a where a : [f (a) = 0] means the (here assumed to be unique) value of a that fulfills f (a) = 0. Furthermore, we introduced the w-weighted expectation value that involves the attention A (w) (s|I B ) defined in Eq. 3.
Let us assume that Alice believes that the curvature of Bob's utility maxima are w * (s).This might be because she can estimate the influence Bob's actions have on his own well being in the different situations.For example, in the extreme case that Bob might be dead in a given s * , she might set w * (s * ) = 0 as none of the possible actions of Bob then matter any more to him.It will turn out that the actual values of w(s) do not matter, only their relative values w.r.t. each other.
Thus, her knowledge about Bob's utility is with P(v, b|I A ) a relatively uninformative probability density.We assume this to be independent on her knowledge on the situation, P(s, v, b|I A ) = P(s|I A ) P(v, b|I A ).
Furthermore, we assume here, in order to have a simple instructive scenario, that Alice only has a vague idea around which value b := ⟨b(s)⟩ (b,s|I A ) the location of the maximum b(s) of Bob's utility could be, and how much it could deviate from b.We assume that she is not aware of any correlation of this function nor any structure of its variance and therefore define b(s In the last step we used that Alice's knowledge on b is uninformative, therefore unstructured, and thus its uncertainty covariance proportional to the unit matrix.The parameter γ expresses how much variance Alice expects in b.Its precise value will turn out to be irrelevant. Other setups in which b(s) is not a constant or D(s, s ′ ) contain cross-correlations, are addressed in App.F.
With the above assumptions, the expected utility is The three terms I-III occurring therein are where we wrote just A for A (w * ) for brevity.Inserting I-III into Eq.83 gives This expected utility needs to be maximized w.r.t.I B , Bob's knowledge after the update.It is obvious that the maximum is at I B = I A as then A(s|I B ) = A(s|I A ) and the negative term becomes zero.
At this maximum we have P(s|I B ) = P(s|I A ) if w * (s) > 0 for all s ∈ S, which means that Alice strives for communicating properly, if possible.Otherwise she tries to minimizes the L 2 -norm between her and Bob's attention distribution functions.
In case S is continuous, properness appears as well as Alice's optimal strategy if G(s, s as is shown in App.F. If these conditions are not met, Alice will optimally transmit a biased attention to Bob.

Derivation
We have seen how attention appears naturally in a communication scenario in which an honest sender tries to be supportive to the receiver, without knowing details of the receiver's utility function except for having a guess for the variation of its narrowness in different situations.The measure used by such a sender to match the receiver's attention function to the own one is then typically of a square distance form, like in Eq. 89, maybe with some bias term as in Eq. 159.In any way, attention seems to be a central element of utility aware communication.
This poses the question whether there is a scenario in which relative attention entropy appears as the measure the sender should use to choose among possible messages.The answer to this question is yes.
In case Alice assumes that Bob will judge her prediction on the basis of how much attention was given to the situation that ultimately happened, and knows the weights Bob will apply to turn probabilities into attentions, as well as wants to be proper, if possible, the relative attention entropy can be derived analogously to the derivation of relative entropy in Sec.2.1, as we see in the following.
Again, we require the measure to be analytical, proper, and calibrated, and only modify the requirement of locality to attention locality: Only Bob's attention A(s * |I B ) for the case s = s * that happens in the end matters for Alice's loss.
Again, at the time Alice has to make her choice, she does not know which situation s = s * will happen in the end, and therefore needs to minimize her expected loss for deciding which knowledge state Bob I B should get (via her message m).This loss will depend on the weight function w(s) that turns probabilities into attentions according to Eq. 3. We assume w(s) > 0 for all s ∈ S in the following, such that A(s|I A ) = A(s|I B ) for all s ∈ S implies P(s|I A ) = P(s|I B ) for all s ∈ S and vice versa.
Attention locality implies that L (w) (s, I A , I B ) must depend on I B through q(s) := A(s|I B ), whereas the dependence on I A could still be through P(s|I A ).However, as the information content of P(s|I A ) and A(s|I A ) are equivalent, it is convenient to express the dependence on I A through p(s) := A(s|I A ), as then properness is given when q = p.
Thus we have where as before we use the function signatures to discriminate different L (w) 's and introduce a Lagrange multiplier to ensure that q(s) is normalized.
Properness then requests that the expected loss should be minimal for q = p, implying for all possible s = s From this follows which is solved analytically by as can be verified by insertion.We note that S ds p(s) w(s) = S ds w(s) P(s|I A ) w(s) S ds ′ w(s ′ ) P(s and choose λ = 1. Calibration requests then that 0 = L (w) (s, x, x) and therefore . Thus, Alice's loss function to choose the message turns out to be the relative attention entropy.This closes its derivation.

Comparison to Other Scoring Rules
A brief comparison of relative attention entropy to other attention based score functions is in order.First we note that in case the weights are constant, w(s) = const, relative attention entropy reduces to relative entropy.
For the comparison to the communication scenario of Sec. 4, in which Alice wants to support Bob as much as possible, but does only know the curvature of Bob's utility, we investigated the limit of small relative difference between the attention function, ∆(s) := (A(s|I B )−A(s|I A ))/A(s|I A ) ≪ 1 for all s ∈ S. In this case, relative attention entropy is well approximated by which is the well-known information metric.Comparing this to the negative loss function of Eq. 89, as generalized to the continuum by Eq. 159 gives under the assumptions of homogeneity and independence (G(s, s ′ ) ∝ δ(s − s ′ ), see App.F for details) There is at least one similarity between these scores, in that deviations between the attention functions should be avoided as the loss increases with their square.However, these scores also differ in a significant point, as for the attention entropy the deviation in attention functions is reversely weighted with Alice attention.This means that relative attention entropy allows for larger deviations in regions of higher attention compared to the utility based score, and smaller deviations in regions of low attention.Finally, we note that weighted relative entropy as well as relative attention entropy are equivalent to scoring rules [25,26].Scoring rules evaluate how well a belief I B matches a probability P(s|I A ) and are -in our notation -of the functional form with L being some loss function that expresses how bad it is to only believe with the strength P(s|I B ) into an event s ∈ S that might happen, with the correct probability P(s|I A ). Scoring rules are used to choose the "best fitting" belief among a set of beliefs, by picking the one that has the lowest score.They are called proper, if the best fit for I B is I A whenever the latter is part of the set of beliefs to choose I B from.Any additive, only I B -independent affine transformations does not change the minimum of the score.Therefore those lead to identical results for I B .Thus, we need only to show for our claim of equivalence that D as well as for relative attention entropy with , and Thus, the well developed formalism of scoring rules [26] can be used to investigate these entropies.It might be interesting to note in that context that weighted entropy is equivalent 8 to a local scoring rule, since its L(s, I B ) depends only on P(s|I B ) for the s in the argument of L. However, attention entropy is (equivalent to)equivalent to a non-local score, as the normalization of the attention function in its L combines values of P(s|I B ) for different s.

Properness, Attention, and Entropy
Entropy is a central element of communication theory.Relative entropy allows a sender to decide which message to send in order to inform about an unknown situation in case only the communicated probability of the situation that finally happens matters.Naively introducing an importance weighting for the different situations into relative entropy renders weighted relative entropy to be improper, meaning that it does not favor to transmit the sender's precise knowledge state in case this is possible.
In order to find guidance how a weighting could be introduced into entropic communication properly, we investigated the scenario in which a sender, Alice, informs a 8 But not identical, due to an irrelevant additional term.receiver, Bob, about a situation that will matter for a decision on an action Bob will perform.The goal of this exercise is to find a scenario that encourages Alice to be on the one hand proper, and on the other hand to include weights into her considerations.Alice can decide which aspects of her knowledge she communicates and which she omits.In case the utility functions of Alice and Bob differ, Alice might be tempted to lie to Bob.This would certainly be improper.We argued that lying should be strongly discouraged if Alice and Bob interact repeatedly, as otherwise Bob might discover that Alice lies and stops cooperating with her, or even punishes her by taking actions that impact her utility negatively.Only the existence of this option for Bob could give Alice a sufficient incentive towards honesty.
But even if Alice is bound to be honest, she still can choose what of her knowledge is revealed to Bob, and what she prefers to keep to herself by communicating diplomatically.In order to be able to influence Bob's action to her advance, Alice has to give him some for him useful information, but only in a way that this information also serves her interests.This way, both expect to benefit from the communication, which is honest, but not proper.
Again, in a repeated interaction scenario, Bob has a chance to discover Alice not being fully supportive to his needs by judging how helpful Alice's communications were and whether there are systematic omissions of relevant information.For example, in the scenario discussed in Sect.3.5, in which Alice's interests are always rotated by 90 • to the left of Bob's, he might realize that her advice makes him choose actions that are typically rotated 45 • to the left of what would have been optimal for him.Under the plausible assumption that her knowledge is generated independently from his utility, a few of such incidents should make him suspicious about Alice really providing him with all of her for him relevant information.Thus, Alice also risks to get a bad reputation by not being fully supportive to Bob.
Assuming then that Alice aligns her interests with Bob's, we still do not find that Alice is forced to be proper, as she only needs to inform him about the aspects of her knowledge that are relevant for his action.
In order to recover properness in this communication scenario, we needed to assume that Alice is fully supportive to Bob, but does not know his utility function in detail.Now she has to inform him properly, to prepare him for whatever his utility is.Furthermore, if she knows how sharply his utility function is peaked in the different situations, she should fold this sharpness as a weight into her measure to choose how to communicate.More precisely, Alice should turn her knowledge state into an attention function, basically a weighted probability distribution that is again normalized to one.And then she should communicate such that Bob's similarly constructed attention function becomes as close as possible to hers.In the discussed scenario, the square difference of the attention functions should be minimized.This quadratic loss function for attentions has a well known equivalent for probabilities, the Brier-score [52].For this an axiomatic characterization exists [53], which requests properness as one of the axioms (there called "incentive compatibilty").Here, we found a communication scenario in which properness emerges from the request that the communication should be useful for the receiver, without having that use specified.
This last scenario therefore provides a communication measure that is proper and weighted.It is, however, a quadratic loss and therefore of a different form than an entropy based on a logarithm.Nevertheless, it shows the path on how to construct such a weighted entropy that leads to properness.
In order to have a proper and weighted entropic measure, we have to request that Alice's communication is judged by Bob on the basis of which attention value she gave to the situation that finally happened.This and the request of properness then determines relative attention entropy as the unique measure for Alice to choose her message.
It should be noted that attention is here formed by giving weights to different possible situations.In machine learning, the term attention is prominent in form of weights on different parts of a data vector or latent space [43][44][45].These two different concepts of attention are not completely unrelated, as giving weight to specific parts of the data implies to weight up possibilities to which these parts of the data point.
Our purely information theoretical motivated considerations should have technical as well as socio-psychological implications, as we discuss in the following.

Technical Perspective
The concept of attention and its relative entropy should have a number of technical applications.
In designing communication systems, the relevance might differ between situations, about which the communication should inform.Attention and its relative entropy guide how to incorporate this into the system design.More specifically, in the problem of Bayesian data compression one tries to find compressed data that imply an approximate posterior that is as similar as possible to the original one, which is measured by their relative entropy [54].However, there can be cases in which the relative attention entropy is a better choice as it permits for importance weighting of the potential situations.
Bayesian updating from a prior P(s|I) to a posterior P(s|d, I) = P(d|s, I) P(s|I) P(d|I) = P(d|s, I) P(s|I) ds P(d|s, I) P(s|I) (108) is of the form of forming an attention function out of the prior distribution, with the weights being given by the likelihood P(d|s, I).Communicating a prior in the light of the data one might already have gotten is then also best done using the corresponding relative attention entropy.
Furthermore, we like to stress that attention functions as defined here are formally equivalent to probabilities and can therefore -formally -be inserted into any formula that takes those as arguments.In particular, all scoring rules for probabilities [8] can be extended to attentions, and therefore attention provides a mean to introduce the concept of relevance into those.
Finally, we like to point out that ensuring that more relevant dimensions of a signal or situation s ∈ R n are more reliably communicated can be achieved by constructions like

Socio-psychological Perspective
Attention, intention, and properness are concepts that play a significant role in cognition, psychology, and sociology [45,[55][56][57][58][59].This work made it clear that utility aware communication naturally involves the concept of attention functions, which guide the choice of topics to the more important aspects of the speaker's knowledge that are to be communicated.As there could be certain situations -for example -in which the different options for actions a message receiver has do not matter much and therefore detailed knowledge of these situation is not of great value to him.The sender of messages should not spend much of her valuable communication bandwidth on informing about these situation of low empowerment to the receiver.
In our derivation of properness and attention we investigated scenarios in which the interests of speaker and receiver deviated.This is a very common situation in sociology.We saw that misaligned interests can leave an imprint in the topic choice of otherwise honest communication partners.Based on our calculations in Sec.3.4, we expect that usefulness of received information decreases the more the interest of the sender differed from the one of the receiver.If the interests are exactly oppositely directed, the sender would prefer to send no information at all.Otherwise, the optimally transmitted information will result in a compromise between the sender's and the receiver's interests.
The fact that misalignment of interests in general reduces the information content of messages in a society of mostly honest actors provides the possibility to detect and measure the level of such misalignment.Furthermore, our analysis shows that the specific topic choices made by communication partners should allow to draw conclusions on their intentions, and on their believes about the intentions of the receiver of their messages.

A Attention Example Calculations
Here, we give details of the calculation for Sect.1.3.Before we calculate the attention functions we note that weighting a Gaussian with an exponential weight function w(s) = exp(λ s) shifts and re-scales it: With this, we see that Alice's attention becomes with the normalization such that as we claimed in Eq. 10.
An analogous calculation gives Bob's attention function, as claimed by Eq. 11.The relative entropy of these -up to terms that do not depend on I B = (m, σ 2 B ) and are dropped (as indicated by " =" in the following whenever happening) -is where we introduced the attention averaging ⟨f (s)⟩ (s|I A ) := ds A (w) (s|I A ) f (s).From this it becomes apparent that A (w) (s|I B ) inherits the first and second moment from A (w) (s|I A ) during the minimization of the relative attention entropy w.r.t.I B : The first moment is The second moment is as claimed in Eq. 13.From this it follows that for the mean of Bob's final knowledge the following equation holds, as claimed by Eq. 12.
Finally, the mean m and uncertainty dispersion σ 2 B of Bob's knowledge state in case Alice uses the weighted relative entropy of Eq. 2 for designing her message to Bob need to be worked out.This entropy -up to irrelevant constant terms -is Minimizing this w.r.t.m yields as Eq. 14 claims.Inserting this into the weighted relative entropy and minimizing w.r.t.σ B yields which was claimed by Eq. 15.This completes the calculations for Sect.1.3.

B Real World Communication
We want to illustrate how Alice's moment constraining messages of the form of Eq. 17 can embrace ordinary, real world communications with an example.A general proof that the communication of moments are sufficiently rich to express any message is beyond the scope of this work.
To have an illustrative example, we look at the statement m = "Tomorrow's weather should be alright".The relevant, but unknown situation s is tomorrow's weather, which we assume for the sake of the argument to be out of S = {bad, alright, good} ≡ {−1, 0, 1}, the latter being a numerical embedding of these situations.It is reasonable then to assume that the statement contains the message "⟨s⟩ (s|I A ) = 0", i.e. the first components of (f, d) are f 1 (s) = s and d 1 = 0. Furthermore the word "should" ∈ {"is going to", "should", "might"} ≡ {0, 1 /4, 1 /2} can be read as a quantifier for the sender's uncertainty on the situation, which shall here be interpreted as a statement on the variance, "⟨(s − d 1 ) 2 ⟩ (s|I A ) = 1 /4", implying f 2 (s) = (s − d 1 ) 2 = s 2 and d 2 = 1 /4.Thus the message is given by m = (f (s), d) = ((s, s 2 ) t , (0, 1 /4) t ).
Of course the here chosen language embedding -meaning a representation of a language in a mathematical structure, as here the representation of statements on weather in terms of topic function f and message data d -is only one possibility out of many.The language embedding used by the speaker and recipient of a message needs to be identical for a high fidelity communication.In reality, the embedding will depend on social conventions that can differ between speaker and recipient.This might in part explain the difficulty of communication across cultures, even if a common language is used.

C Accurate Communication
Here, we show that the message format of Eq. 17 permits Alice in principle to transfer her exact knowledge to Bob if there are no bandwidth restrictions.In case she knows his knowledge state P(s|I 0 ), she could simply send the relative surprise function f (s) = − ln(P(s|I A )/P(s|I 0 )) as well as d = ⟨f (s)⟩ (s|I A ) = D s (I A , I 0 ), which turns out to be the amount of information she is transmitting.Bob updates then to P(s|I B ) = P(s|I A ) as a straightforward calculation shows: In case she does not know his initial belief, she could alternatively send her knowledge by using the vector valued topic f (s) = (δ(s − s ′ )) s ′ ∈S .This lets the message data d = ds P(s|I A ) (δ(s − s ′ )) s ′ ∈S = (P(s ′ |I A )) s ′ ∈S be a vector that contains her full probability function to which Bob would then update his knowledge.

D Topic Gradient
Here we work out the gradient of Alice's utility w.r.= − δf (s) δf (s) with a = a B and µ according to Eq. 47 and 51, respectively.

E Specific Topic Gradient
Here, the topic gradient given by Eq. 53 at f (s) = τ t s is calculated for the simple example of misaligned interests of Alice and Bob as discussed in Sec.3.4.
In order to have a concise notation, let us first note that and therefore a = µ τ.
With this, the building blocks of the gradient given by Eq.

Figure 2 :
Figure 2: Attention functions corresponding to the cases λ = 0, 1, 2, and 4 of Fig. 1 on logarithmic scale to display the unattended peak of the Alice's attention.Note that due to the strong exponential focus of the weights on larger s-values the attention peaks are displaced to the right w.r.t. the corresponding knowledge peaks.

Figure 3 :
Figure3: Sketch of the investigated communication scenario.Alice communicates parts of her knowledge to Bob about an unknown situation.After updating according to Alice's message Bob chooses his action, which for simplicity is here assumed to be a point in situation space S (for example, the blue point could indicate to which situation his action is best adapted).His action and the unknown situation determine Bob's resulting utility.Bob chooses his action by maximizing his expected utility given his knowledge after Alice informing him (blue equal probability contours of P(s|I B ) in his mental copy of the situation space).The action and situation also determine a utility for Alice, which may or may not equal Bob's utility.Alice chooses her message such that her expected utility resulting from Bob's action is maximized in the light of her situation knowledge P(s|I A ) (red contours).In case she is honest, she can only choose which parts of her knowledge she reveals with her message by deciding on a message topic f (s); the message data is then determined to be d = ⟨f (s)⟩ (s|I A ) .
meaning that Alice's remaining interest g A (a|I A ) is perfectly satisfied by Bob's resulting action a, • for any situation s * a sophisticated balance b(s * ) := {terms in curly brackets of Eq. 53} = 0 holds between Bob's interest g B (s * , a) in that situation and the difference in the probabilities Alice and Bob assign to it, P(s * |I B ) − P(s * |I A ), or • the not balanced term b(s * ) is orthogonal to Alice's remaining interest g A (a|I A ) w.r.t. a metric given by the inverse Hessian of Bob's expected utility u B (a|I B ) (as derived w.r.t.his action a),

Figure 4 :
Figure4: Knowledge states and preferred actions of Alice and Bob in case of misaligned interests before (left) and after (right) the communication.The plane of s-values is shown.Bob's knowledge state initially, P(s|I 0 ) (left), and finally, P(s|I B ) (right), is shown by the background color as well as by the blue contour lines at the 1-and 2-sigma levels.Alice's more precise knowledge is indicated only via red 1-, 2-, and 3-sigma level contours.The dots mark possible actions for Bob that are optimal for him under his knowledge (blue), under Alice's knowledge (green), or optimal for Alice (red).Comparing the two panels, especially the movement of Bob's optimal action (blue dot) between them, shows that Alice informs Bob such that he chooses an action that is a compromise between their interests.
s (I A , I B ) can be brought into the extended form S ′ (I A , I B ) = a(I A ) S ds P(s|I A ) L(s, I B ) + b(I A ). (101) This works for weighted relative entropy by choosing a) s (I A , I B ) := D (w) s (I A , I B ) + n i=1 c i D si (I A , I B ), (109) in which the additional relative entropies for individual signal directions are weighted according to c = (c i ) n i=1 ∈ R + 0 n .The term D (w) s (I A , I B ) ensures propriety of the resulting scoring rule for any c ∈ R + 0 n .

(
s|I B ) = d df (s * ) ⟨g B (s, a)⟩ (s|I B ) = ds g B (s, a) dP(s|I B ) df (s * ) = ds g B (s, a) P(s|I B ) dln P(s|I B ) df (s * ) = g B (s, a) d [µ t f (s) − ln Z(µ, f )] df (s * ) (s|I B ) = g B (s * , a) P(s * |I B ) µ t + g B (s, a)f (s) t (s|I B ) dµ df (s * ) − ⟨g B (s, a)⟩ =0 d ln Z(µ, f ) df (s * ) = g B (s * , a) P(s * |I B ) µ t − g B (s, a)δf (s) t (s|I B ) δf (s) δf (s) t −1 (s|I B ) × [P(s * |I B ) − P(s * |I A )] ,(133)withδf (s) := f (s) − ⟨f (s)⟩ (s|I B ) ,(134)since according to the inverse function theorem applied to the quantity determining µ µ : 0 = g(µ, f) := ∂ln Z(µ, f ) ∂µ − ⟨f (s)⟩ (s|I A ) * ) ∂ln Z(µ, f ) ∂µ − ⟨f (s)⟩ (s|I A ) = − δf (s) δf (s)t −1 (s|I B ) ∂ ∂f (s * ) ⟨f (s)⟩ (s|I B ) − ⟨f (s)⟩ (s|I A ) consists of the two essential elements of her message m = (τ t s, d).However, being honest, Alice is not fully free in what she can say.With choosing the topic direction τ the message's data d = τ t s A are fully determined thanks to her honesty.Therefore, Alice's expected utility as we claimed.This implies a B = d τ = τ τ t s A .Thus in this situation Bob does exactly what Alice tells him, as a B = d τ 47 the topic of her honest communication given in Sect.3.2 according to Eqs. 46-51.This gradientdu A (f ) df (s * ) = du A (s, a B )df (s * ) product between Alice's expectation for her utility gradient ⟨g A (s, a B )⟩ (s|I A ) given Bob's action a B and how his action changes with changing topics of her communication.This gradient vanishes when Bob happens to choose the for Alice optimal action, such that ⟨g A (s, a B )⟩ (s|I A ) = 0 and therefore ⟨g A (s, a B )⟩ (s|I A ) = ⟨g B (s, a B )⟩ (s|I B ) as the latter is zero thanks to Bob's choice of action, see Eq. 39, or when a further change in f (s * ) does not change Bob's action a B any more.Bob's chosen action is the result of the minimization in Eq.47.Its gradient w.r.t.f can be worked out using the implicit function theorem: da B df (s * ) = − ∂ 2 u B (a B |I B ) ∂a B ∂a t B −1 ∂ du B (a B |I B ) (s)f (s) t (s|I B ) − ⟨f (s)⟩ (s|I B ) f (s) t t −1 (s|I B ) [P(s * |I B ) − P(s * |I A )] t = f B (s * , a) P(s * |I B ) µ t − g B (s, a)δf (s) t (s|I B ) × δf (s) δf (s) t −1 (s|I B ) [P(s * |I B ) − P(s * |I A )] ,