Stochastic approximation for uncapacitated assortment optimization under the multinomial logit model

We consider dynamic assortment optimization with incomplete information under the uncapacitated multinomial logit choice model. We propose an anytime stochastic approximation policy and prove that the regret—the cumulative expected revenue loss caused by offering suboptimal assortments—after T$$ T $$ time periods is bounded by T$$ \sqrt{T} $$ times a constant that is independent of the number of products. In addition, we prove a matching lower bound on the regret for any policy that is valid for arbitrary model parameters—slightly generalizing a recent regret lower bound derived for specific revenue parameters. Numerical illustrations suggest that our policy outperforms alternatives by a significant margin when T$$ T $$ and the number of products N$$ N $$ are not too small.


INTRODUCTION
Assortment optimization is vital for maximizing revenue. A seller of a large number of substitute products faces the challenge of determining the most profitable subset, that is, assortment, of products to offer to consumers. In most practical situations, the seller does not know the specific demand distribution for all assortments, so that the optimal assortment optimization has to be studied in a sequential optimization framework with incomplete information. As the infrastructure of information improves to better handle incoming real-time purchase data, the necessity for computationally efficient and easily implementable data-driven algorithms arises.
Because the number of feasible assortments grows exponentially in the number of products, the problem is often studied under a particular choice model that describes how demand or choice probabilities depend on the offered assortment. Perhaps the most widely used choice model is the multinomial logit (MNL) choice model. When the seller can offer assortments of any size, the optimal assortment under the MNL model is of the form "offer the k most profitable products"-see, for example, Talluri and van Ryzin (2004, proposition 6). This structure greatly simplifies the problem as the seller merely has to learn the optimal value of the one-dimensional quantity k instead of high-dimensional model parameters, suggesting that a conceptually simple stochastic-approximation type of policy might work well. In this article, we construct such a stochastic-approximation policy for the dynamic assortment optimization problem without capacity constraint and with demand characterized by the MNL model with unknown parameters, and show that it is both asymptotically optimal and has excellent numerical performance. Agrawal et al., 2017;Cheung & Simchi-Levi, 2017;Chen & Wang, 2018;Ou et al., 2018;Agrawal et al., 2019;Kallus & Udell, 2020;Chen et al., 2021). All these papers study assortment optimization under the MNL model in a sequential decision framework. An important recent contribution closely related to our work is Chen et al. (2021). The authors construct a Trisection policy that exploits the structure of the optimal assortment under the uncapacitated MNL assortment optimization problem ("offer the k most profitable products"), and prove that its regret-the total expected revenue loss caused by offering suboptimal assortments-after T time periods is bounded by √ T times a constant that does not depend on the number of products. This improves earlier regret rates that do depend on the number of products-see, for example, the bounds derived in Rusmevichientong et al. (2010, theorem 3.4), Sauré and Zeevi (2013, theorem 4), Agrawal et al. (2017, theorem 1) and Agrawal et al. (2019, theorems 1, 3, and 4), applied to a maximum capacity of K = N products. In addition, Chen et al. (2021) construct an instance where the marginal revenue is 1 for odd-numbered products and 1∕2 for even-numbered products, and prove that, for this instance, the worst-case regret over a set of possible model parameters that any policy must endure is bounded from below by a constant times √ T. This implies that the √ T regret rate can in general not be improved.
Our work is also related to Peeters et al. (2021), who study a stochastic approximation policy in the context of dynamic assortment optimization under a continuous logit choice model where the set of feasible products is the unit interval. The stochastic approximation policy that we propose in this work can been seen as a discrete counterpart of the policy constructed and analyzed in Peeters et al. (2021, section 4) but with different regret rates (and different proof techniques) caused by the structural differences between continuous and discrete assortment optimization. This is visible in the regret rates that we derive: in the continuous setting studied by Peeters et al. (2021, section 4) a regret growth rate of log T is optimal, whereas in the discrete setting studied in this article

√
T is the best attainable rate.

Contributions and outline
In this article, we propose an easily implementable and asymptotically optimal data-driven policy for the uncapacitated assortment optimization problem under the MNL model with unknown parameters. Our policy is based on stochastic approximation and exploits structural properties of the optimal assortment so that not all unknown model parameters have to be learned from data. The policy does not require the time horizon as input. Under a mild positivity assumption on the no-purchase probability, we prove that the regret of our policy is bounded from above by √ T times a constant independent of the number of products. In addition, we prove a √ T regret lower bound that any policy must endure for any given vector of product revenues. This slightly generalizes the lower bound proven by Chen et al. (2021), and implies that policies with ( √ T) regret are asymptotically optimal for any product revenue parameters. We conduct numerical experiments that demonstrate that our policy has a robust performance in different instances, and can outperform alternative algorithms by a significant margin when T and the number of products are not too small.
We emphasize that our policy is not the first to have been shown to be asymptotically optimal (that is achieved by Chen et al. (2021)); the value of our policy lies in the fact that it is easy to understand and implement, and has superior numerical performance when N and T are not too small.
The rest of this article is organized as follows. We introduce the model in Section 2. In Section 3, we present our policy and the upper bound on its regret, as well as the lower bound result of the regret of any policy. The numerical experiments are contained in Section 4. Mathematical proofs are collected in the Appendix.

MODEL
We consider a seller who has N ∈ N different products for sale during T ∈ N time periods, and who has to determine at the beginning of each time period which subset of products is available for purchase. We abbreviate the set of products {1 … , N} as [N] and the set of time instances {1, … , T} as [T]. Each product i ∈ [N] yields a known marginal revenue of w i > 0. We assume that w i ⩽ 1 for all i ∈ [N], and that the products are ordered in strictly increasing order with respect to their marginal revenue: w 1 < · · · < w N . (Note that, in the uncapacitated MNL model, products that yield the same marginal profit to the seller can be considered as the same product.) Each product i ∈ [N] is associated with a preference parameter v i ⩾ 0, unknown to the seller. At the beginning of each time period t ∈ [T] the seller selects an assortment S t ⊆ [N] based on the purchase information available up to and including time t − 1. Thereafter, the seller observes a purchase Y t ∈ S t ∪ {0}, where product 0 corresponds to a no-purchase. Clearly, w 0 = 0 and we set v 0 -the preference parameter for product 0-equal to 1. The purchase probabilities under the MNL model are given by The expected revenue earned by the seller from an assortment S ⊆ [N] is denoted by The assortment decisions of the seller are described by his/her policy: a collection of probability distributions = ( is the set of possible histories, and where, conditionally on h = (S 1 , Y 1 , … , S t−1 , Y t−1 ), assortment S t has distribution (⋅| h), for all h ∈ H and all t ∈ [T]. Let P v denote the probability measure of {S t , Y t ∶ t ∈ N} under policy and preference vector v, and let E v be the corresponding expectation operator. The objective for the seller is to find a policy that maximizes the total accumulated revenue or, equivalently, minimizes the accumulated regret: In addition, we consider the worst-case regret over a class  of preference vectors: The class of preference vectors  under consideration consists of all vectors v that satisfy the following assumption.
Assumption 1  is the set of all componentswise non-negative vectors for some p 0 ∈ (0, 1) known to the seller.
This assumption is arguably mild as it simply puts a lower bound p 0 on the no-purchase probability that should be strictly positive but can otherwise be arbitrarily small. Note that Assumption 1 does not incorporate a "separability" condition (as in, e.g., Rusmevichientong et al., 2010, assumption 2.1) between the true optimal assortment and sub-optimal alternatives. The use of Assumption 1 is to ensure that the policy parameters guarantee convergence. For the choice of p 0 , see further Remarks 1 and 2.

ASSORTMENT OPTIMIZATION
In this section, we discuss the structure of the optimal assortment, propose a policy for incomplete information, and show that its regret is bounded by √ T times a constant independent of N. Then we show that, for arbitrary revenue parameters, the regret of any policy is bounded from below by a constant times √ T, implying that our policy is asymptotically optimal in a general setting. The mathematical proofs are contained in the Appendix.
For uncapacitated assortment optimization under the MNL model, it is known that the optimal assortment is of the form 'offer the k most profitable products' for some integer k ∈ [N], see, for example, Talluri and van Ryzin (2004, proposition 6). To facilitate our analysis, we define Then it is also known that the sequence  ≔ ( . This function is piecewise constant and attains every value in . Moreover, the assortment S * = S * with * ≔ max S⊆ [N] r (S, v) is optimal under preference vector v, see, for example, Chen et al. (2021, section 4). This structure has compelling computational implications, as we only need to approximate * instead of the more straightforward approach of establishing estimates of r (

A policy for incomplete information
In this section, we propose a policy to iteratively establish a sequence of assortments that converges to the optimal assortment. The policy follows the intuitive approach of offering products with marginal revenue above a threshold value . Based on the observed (no-)purchases, each next threshold value t+1 is selected. The policy is parameterized by ⩾ 1 and ⩾ − 1.

Stochastic approximation policy SAP
Put t ≔ t + 1. If t ⩽ T, then go to 2, else to 3.

Terminate.
The policy SAP ( , ) is a classical stochastic approximation algorithm (Robbins & Monro 1951;Kushner & Yin 1997) that relies on the observation that * is the unique solution to the fixed point equation This is easily verified (see, e.g., Lemma 2 from Chen et al. (2021)). Note that we do not directly observe the value of r ( S , v ) . However, we do have the unbiased, noisy observation w Y given offered assortment S at our disposal. As a result, the sign of w Y − approximately indicates the direction in which * is situated in relation to . The step sizes a t decays approximately as 1∕t, ensuring the correct convergence rate of t to * , as t does not "keep jumping over" * , and t does not converge "too slow."

Regret upper bound
We proceed by showing that the worst-case regret of SAP ( , ) is bounded from above by √ T times a constant independent of N.
The proof relies on two main steps. First, we show that the regret is bounded from above by the following expression: (1) for all v ∈  and some instance-independent constant c 1 . Second, we show a recursive relationship regarding the mean squared error of t with respect to * . Then the convergence rate of t to * is a consequence of the recursive relation. This second step is summarized in the following lemma.

Lemma 1 Let v ∈  and let correspond to
As a result, there exists an instance-independent constant c 2 such that, for all t = 1, … , T + 1, We then apply Jensen's inequality to the concave function x  → √ x to conclude that the mean absolute error of t with respect to * converges as 1∕ √ t, that is, With the convergence rate of the mean absolute error in combination with (1), the proof of Theorem 1 follows.
Remark 1 The requirement of ⩾ 1∕p 0 in Theorem 1 is caused by the constant p 0 in (2), and ensures an ( √ T) regret for all preference vectors v ∈ , even with small corresponding no-purchase probability. However, as the performance of SAP depends on its parameters, a large value of and when p 0 is small might not be necessary in practice. The inequality (2) is also valid when p 0 is replaced by and where * v denotes the optimal revenue as function of the preference vector v. We still obtain an ( √ T) regret bound as long as ⩾ 1∕ v , which is a weaker requirement on since v might be substantially larger than p 0 .
This example shows that there exists instances within  where v is of the order p 0 as well as instances where v ≫ p 0 . Hence, the requirement that is of the order 1∕p 0 , although necessary for bounding the worst-case regret as done by Theorem 1, can for some preference vectors be mitigated while maintaining good case-specific performance in practice. For a numerical illustration we refer to Section 4.
Remark 2 As long as the assumed value of p 0 is lower than v defined in (4), with v the true preference vector, the policy performs as intended and our convergence results are valid. The speed at which t converges to * is negatively affected, however, if p 0 is substantially smaller than v . On the other hand, if the assumed value of p 0 is higher than v , then there is no guarantee of convergence. More specifically, the policy could offer sub-optimal assortments "too often," as t could get "stuck" when t ≪ * . This corresponds to the parameter being set too low. In a sense, the choice of p 0 captures the trade-off between generality and performance: decreasing p 0 enlarges the set of preference vectors for which convergence is guaranteed, but simultaneously decreases convergence speed (although not the rate as function of time). In practice, p 0 can be selected in a data-driven fashion, before the start of the algorithm, by offering the full assortment a number of times n, constructing a confidence interval around the observed fraction of no-purchasesp (for example with b n = √ 2 log(1∕ )∕n for some ∈ (0, 1)), and setting p 0 equal to the lower end of the confidence interval (unlessp − b n ⩽ 0, in which case the experiment needs to be prolonged). Then, p 0 will with high probability be a lower bound to the true but unknown no-purchase probability.

Regret lower bound
Now that we have provided an upper bound on the regret of our SAP policy, we proceed by showing that this bound is asymptotically tight-up to a multiplicative constant-as T grows large. This implies that our policy performs asymptotically optimal. We prove our regret lower bound for values p 0 in Assumption 1 such that Observe that this condition can always be ensured to hold by choosing p 0 sufficiently small. The regret lower bound is presented below.
Theorem 2 There exists a C > 0 such that, for all policies and all T ∈ N, The proof of Theorem 2 is established in three steps. First, we construct two preference vectors v 0 and v 1 which are statistically "difficult to distinguish." Second, we show that any estimator that has the observed purchases Y 1 , … , Y T as inputs and outputs either 0 or 1 must satisfy max j=0,1 Third, we define a specific estimator and show that under the assumption that Δ (T) < C √ T it follows that for both j = 0, 1. Having found a contradiction with (6), we thus conclude that the statement in Theorem 2 must hold. The novelty of this proof is concentrated in the first step, which allows us to prove Theorem 2 for arbitrary revenue parameters. The second step and third step are conceptually the same as in Chen et al. (2021). We include the last two steps because of the slight deviation in our set-up as well as for the sake of completeness.
Starting with the first step, we define two quantities. Let k, ∈ [N] such that k < and Note that such k, exist by Equation (5). Furthermore, define The idea underlying this set-up is as follows. Let u denote the preference vector that has u k as its kth component, u as its th component and is zero everywhere else. Then it holds that Then by perturbing u in two ways we can construct two preference vectors v 0 and v 1 which are very close to u, but result in different solutions to the assortment optimization problem. Specifically, we define v 0 and v 1 as follows. Let ∈ (0, 1∕2] denote a small constant and From this set-up it follows that {k, } is the optimal assortment under v 0 and { } is the optimal assortment under v 1 (up to inclusion of products with zero preference). That is, In what follows we abbreviate the expectation value and probability E v j [⋅] and P v j (⋅) as E j [⋅] and P j (⋅) for j = 0, 1. We suppress the notation that these two notions depend on policy , as Theorem 2 holds for any policy. Now, for the second step we use the following lemma, where we first bound the Kullback-Leibler (KL) divergence.

Lemma 2 Let S ⊆ [N].
Then there is constant c 3 > 0 such that In addition, let ∈ {0, 1} denote an arbitrary estimator which has random purchases Y 1 , … , Y T as inputs. Then The lemma above contains a similar statement as lemmas EC.1 and 12 from Chen et al. (2021). However, we look at different preference vectors than considered by Chen et al. (2021) because we allow arbitrary revenue parameters. The proof of the lemma above makes use of Pinsker's inequality as well as Le Cam's method. From Lemma 2 and by setting equal to ≔ min it follows that (6) holds. As first part of the third step, we provide a lower bound of the regret under v 0 and v 1 in terms of T, , ℘ 0 , and ℘ 1 , where ℘ 0 denotes how often k and are both contained in S t for t ∈ [T] and ℘ 1 denotes how often k is excluded while is contained in S t for t ∈ [T]. That is, This lower bound is stated in the lemma below.
Lemma 3 There is a constant c 4 > 0 such that, for any policy , and The lemma above covers similar statements shown by Chen et al. (2021) in the proof of lemma 13. Here as well, however, we look at different preference vectors than considered by Chen et al. (2021), because we allow arbitrary revenue parameters.
The remainder of the third step is established by a contradiction. To this end, assume that where In addition, in view of (8) and (9) we define L 0 and L 1 as As a consequence of the assumption in (10), we conclude by Markov's inequality and Lemma 3 that Next, we define the estimator as The remainder of the proof is to show that = 1 implies L 0 > 4C

√
T and that = 0 implies L 1 > 4C √ T. From this we conclude that This is a contradiction with the statement in (6), which holds by Lemma 2. Therefore, statement (10) cannot be true and we have thus proven Theorem 2.

NUMERICAL EXPERIMENTS
In this section, we compare the performance of our proposed policy with alternatives for dynamic assortment optimization under the MNL model. In particular, we compare our policy SAP with the Thompson sampling (TS) based policy from Agrawal et al. (2017) and the Trisection (TR) policy from Chen et al. (2021) using simulated data. It is worth observing that only for SAP and TR an ( √ T) regret bound has been proven. We simulate purchase data in two scenarios.

Scenario 1 For different values of N and T
we draw N values uniformly at random from [0.4,0.5], order the randomly drawn values in increasing order, and set the ordered values as the revenue parameters w 1 , … , w N . We draw the preference parameters v 1 , … , v N uniformly at random from [10∕N, 20∕N].
Scenario 2 As in Scenario 1, but now we draw the unsorted revenue parameters uniformly at random from [0, 1], order the randomly drawn values in increasing order, and set the ordered values as the revenue parameters w 1 , … , w N ; in addition, we draw the preference parameters v 1 , … , v N uniformly at random from [1∕N, 100∕N].
The experimental setup described by Scenario 1 follows the set-up from Chen et al. (2021). This setup ensures that finding the optimal solution is non-trivial as the optimal assortment is equal to {i ∶ w i ⩾ x} for some x that lies approximately within [0.4, 0.5]. We include the simulated results from Scenario 2 to investigate how the policies perform for a broader range of parameters: the range of possible no-purchase probabilities when offering the entire set of products is [1∕101,1∕2] for Scenario 2, whereas for Scenario 1 the range is [1∕21, 1∕11].
In both simulation scenarios we set N to values ranging from 25 to 5000, roughly doubling each step, and we simulate the regret of the policies a total of 1000 times. For each N, we draw the preference and revenue parameters and, for the Trisection policy, let the policy run for T = 500, 1000, 2500, 5000. We let our policy SAP and the Thompson Sampling policy run for T = 5000 and record the intermediate regret at t = 500, 1000, 2500.
For the Trisection policy, we use the version with adaptive confidence levels, see, for example, Chen et al. (2021,  section 6). In addition, in line with the simulations of Chen 1 Average (mean) of the simulated regret in Scenarios 1 and 2 for the policies TS, TR, SAP (with = 1∕p 0 and = − 1) and ASAP (with = 1 and = 0) based on 1000 runs. notice that the preference parameters satisfy Assumption 1 for p 0 = 1∕21 and p 0 = 1∕101 for Scenarios 1 and 2, respectively. Theorem 1 indicates that we should set the parameter of SAP as = 21 and = 20, and = 101 and = 100. Following Remark 1, we can obtain a better performance with smaller parameters. Hence, we include the performance of SAP with = 1 and = 0. This policy is referred to as Adjusted SAP (ASAP). In addition, in all instances we start SAP with 1 = w 1 . Table 1 reports the average regret of the policies in Scenarios 1 and 2 over 1000 simulation runs. All the standard errors are within 3%. From Table 1 we see that in both scenarios, the regret of SAP and ASAP does not appear to grow large when N increases, which is in line with our theoretical findings. In addition, we find that, in Scenario 1, SAP performs on par with TR for T = 500, 1000, and substantially outperforms TR for larger values of T. In Scenario 2, SAP outperforms TR significantly for all values of N and T. We also find that the adjusted policy ASAP performs improves performance even further by several magnitudes. When we compare SAP to TS, we see that TS performs well for small values of N whereas SAP is better for large values of N. The fact that the regret of TS appears to be growing substantially in N suggests that TS is not an asymptotically optimal policy (i.e., with regret bounded by a factor that is independent of N).

Results
Overall, we find that SAP and ASAP perform well and robustly for increasing values for N. In addition, we see that adequate parameter adjustment for SAP can yield a significant benefit over established alternatives, especially for large values of N. This suggests that designing an algorithm where , are adaptively tuned may be an interesting direction for future research.
For illustrative purposes we include visual graphs in Figure 1 for the average regret of the policies for N fixed at 5000.

CONFLICTS OF INTEREST
The authors have no conflicts of interest to disclose.

DATA AVAILABILITY STATEMENT
Data sharing is not applicable to this article as no new data were created or analyzed in this study.

Proof of Lemma 1 First, observe that
For bounding the cross term in (A1), first suppose that t < * . Then, we define ( ) as Note that * =  ( * ) since * = r ( S * , v ) and note that S * ⊆ S t . Therefore, Next, consider the case that t ⩾ * . Then, it holds that h ( t ) ⩽ h ( * ) = * and since p 0 ⩽ 1 Either way, the cross term in (A1) is bounded from above as As a result, we conclude that Taking the expectation on both sides yields (2). For t = 1, … , T + 1, denote the expected squared difference of * and t as ] .
By induction, we show that the inequality (2) implies t ⩽ c 2 ∕(t + ), for all t = 1, … , T + 1 and some c 2 > 0. To this end, let c 2 be the constant For t = 1, we observe that Next, assume that t ⩽ c 2 ∕(t + ) for t ⩽ t 0 for some t 0 . Then, for t > t 0 , it follows that since ⩾ 1∕p 0 and so, by our choice of c 2 , It follows that and therefore This, by (2) as well as by the induction hypothesis, yields t+1 ⩽ where the last inequality holds because 2(t + ) 2 ⩾ t + + 1 as t + ⩾ 1 and c 2 ⩾ 2 2 , so that we have shown the lemma. ▪

Proof of Theorem
r (S, v).
Recall that * = h ( * ); see, for example, Chen et al. (2021, section 4). Next, note that Also, note that, for t ∈ [T], w Y t can be written in terms of a t , t , and t+1 as As a consequence of (A2) and (A3), it holds that Given that | * − 1 | ⩽ 1, it follows that

From Lemma 1 and Jensen's inequality for the concave function
From this, we conclude that the regret is bounded from above as

IN SECTION 3.3
In this section, we abbreviate the expectation value and probability E v j [⋅] and P v j (⋅) as E j [⋅] and P j (⋅) for j = 0, 1 with v 0 and v 1 as in (7).

Proof of Lemma 2
First, note that if ∉ S, then P 0 (⋅| S) = P 1 (⋅| S). Hence, we may assume that ∈ S. In addition, note that P j (⋅| S) = P j (⋅| S ∩ {k }) for j = 0, 1. Therefore, it suffices to check S = {k, } and S = { }. To this end, define p i = P 0 (Y = i| S) and q i = P 1 (Y = i| S), where Y denotes a random purchase from S. First, consider the case that S = {k, }. Then, for i = 0, k, it holds that since v 1 > u and < 1 and u < u k and 1 < u k . Next, note that and as a result Consequently, where the first inequality is easily verified (see, e.g., Lemma 3 from Chen and Wang (2018)) and Now, consider the case that S = { }. Similarly, we derive for i = 0, that since v 1 > u and < 1 and u < u k and 1 < u k . Next, note that (1 + (1 − )u ) (1 + (1 + )u ) ⩾ 2u and as a result Consequently, where the first inequality is again easily verified (see, e.g., Lemma 3 from Chen and Wang (2018)). The final statement of the lemma follows by applying Pinsker's inequality and Le Cam's method as follows. First, consider the entire probability measures P 0 and P 1 . By the chain rule of the KL divergence it follows that From Pinsker's inequality it follows that the total variation (TV) norm between P 0 and P 1 is bounded from above as Next, Le Cam's method entails to consider B ≔ { = 0}. Then, it follows that max j=0,1 P j ( ≠ j) ⩾ 1 2 (P 0 ( = 1) + P 1 ( = 0)) which concludes our proof. ▪ Proof of Lemma 3 First note that, for j = 0, 1, the expected profit under v j is independent of the inclusion of products i ∉ {k, }. That is, We start with bounding the regret under v 0 from below. By (B1) we only need to consider S t = {k}, { }, ∅. Then, as u k w k = u w and u k = u + 1 and ∈ (0, 1∕2], it follows that as well as, and, in addition, In general, we know that (2 + 3u ) 2 ( 1 {k ∉ S t , ∈ S t } + 1 − 1 {k, ∈ S t } − 1 {k ∉ S t , ∈ S t }) , and by taking the expectation with respect to v 0 and summing over t ∈ [T] we have shown (8) for c 4 = u w (2 + 3u ) 2 .
Next, we bound the regret under v 1 from below. Again by (B1) we only need to consider S t = {k}, {k, }, ∅. Then, as u k w k = u w and u k = u + 1 and ∈ (0, 1∕2], it follows that as well as, and, in addition, In general, we know that and by taking the expectation with respect to v 1 and summing over t ∈ [T] we have shown (8) with c 4 = u w (2 + 3u ) 2 as well. ▪ Proof of Theorem 2 As discussed in Section 3.3, we have established the first two steps: constructing two preference vectors v 0 and v 1 and showing that-as a consequence of Lemma 2-for any estimator that has as input the observed purchases Y 1 , … , Y T and outputs either 0 or 1 it holds that max j=0,1 It remains to show that v 0 , v 1 ∈  and finish the third step by establishing a contradiction for a specific estimator . First, note that for both j = 0, 1 we have that as u = u k − 1 and ⩽ 1∕2. Consequently, v 0 and v 1 lie in the class  since We continue our proof by establishing a contradiction. To this end, recall that C = ) and recall the definitions of ℘ 0 , ℘ 1 , L 0 , and L 1 : Now, we assume that As a consequence of the assumption above, we conclude by Markov's inequality and Lemma 3 that Next, define the estimator as From = 1 it follows that as ℘ 0 + ℘ 1 ⩽ T and < 1. Now note that if = 1∕2, then .
From this we conclude that = 1 implies L 0 > 4C √ T and therefore P 0 ( = 1) ⩽ P 0 Now consider = 0. This implies , as ℘ 0 + ℘ 1 ⩽ T and where the last inequality is established as before. From this we conclude that = 0 implies L 1 > 4C √ T and therefore We conclude that (B4) and (B5) contradict (B2). Hence, the assumption in (B3) cannot be true. ▪