A fundamental problem of hypothesis testing with finite inventory in e-commerce

In this paper, we draw attention to a problem that is often overlooked or ignored by companies practicing hypothesis testing (A/B testing) in online environments. We show that conducting experiments on limited inventory that is shared between variants in the experiment can lead to high false positive rates since the core assumption of independence between the groups is violated. We provide a detailed analysis of the problem in a simplified setting whose parameters are informed by realistic scenarios. The setting we consider is a $2$-dimensional random walk in a semi-infinite strip. It is rich enough to take a finite inventory into account, but is at the same time simple enough to allow for a closed form of the false-positive probability. We prove that high false-positive rates can occur, and develop tools that are suitable to help design adequate tests in follow-up work. Our results also show that high false-negative rates may occur. The proofs rely on a functional limit theorem for the $2$-dimensional random walk in a semi-infinite strip.


Introduction
The golden standard for testing product changes on e-commerce websites is large scale hypothesis testing also known as A/B-Testing.
When a given version of a website is modified, it is natural to ask whether or not the modified (new) version of the website performs better than the old one. It is very common to use the following approach based on classic hypothesis testing: During a fixed time period, the so-called testing phase, whenever customers visit the website, they are displayed one of the two versions of it, where the choice which one they get to see is random. For each version of the website the owner thus collects a sample containing for each customer visiting that website relevant data such as whether or not they bought a good or how much money was spent by the customers, etc. Then a statistical test (A/B test) is applied to evaluate which version of the website performed better.
Typically, these tests rely on the assumption of independent samples. In the present paper, we point out in a quantitative way that in the situation where there is a finite amount of a popular good the independence assumption is not feasible and can often lead to wrong conclusions. The inventory is shared between variants and if a copy of an item is sold it can not be bought by users that enter the experiment later. This implies that users are not independent both inside as well as between the variants. These dependencies could be avoided by randomly splitting on an item level instead of a user level, but this would reduce the choice of the customer and is therefore not a realistic setup.
We think it is best to illustrate the dependence problem with a ranking example that we will use throughout the paper. We limit ourselves to only two distinct products (which each should be thought of as a variety of different products grouped into one). Using realistic parameters we show that two different ranking algorithms which perform identical if run independently show a significant difference almost 20% of the time when run in an industry standard A/B experiment. We also show that if there is a difference in performance between the algorithms there are scenarios where the power of a standard A/B test is close to 0. Example 1.1 (Ranking experiment, take 1). We consider a ranking experiment with two types of goods, one rare good, which is very attractive (good 2), and a second, less attractive good (good 1) available in practically unlimited quantities. In applications, there may be more than two types of goods, but the less attractive ones are labeled as type-1 goods, while the most attractive ones are labeled as type-2 goods. Suppose that in total there are 1 000 goods of type 2 and 1 000 000 goods of type 1.
A website displays the available goods to each visitor. The goods are displayed in a certain order, which depends on the ranking algorithm used. The owner of the website wants to compare two different algorithms. The default algorithm, Algorithm 0, displays the goods such that the type-2 goods have a low ranking and appear late in the list. Thus, only a fraction of the visitors gets to see them. The new algorithm, Algorithm 1, displays the goods such that the type-2 goods have the highest ranking and appear first in the list. Every visitor seeing the goods ranked by Algorithm 1 will see both, type-1 and type-2 goods (as long as they are available). The goal is to find out which of the two algorithms leads to a higher overall conversion rate, i.e., to a higher empirical probability to make a sale.
Suppose that during a test phase, n = 4 000 000 customers visit the website. Whenever a customer visits the website, a fair coin is tossed. If the coin shows heads, the products are displayed ranked according to Algorithm 1, if the coin shows tails, the products are displayed ranked according to Algorithm 0.
We now make the following model assumptions. We assume that, independent of all other customers, each customer has a chance of 20% of preferring good 1 over good 2 and an 80% chance of preferring good 2 over good 1. When the goods are ranked according to Algorithm 0, the customer first sees type-1 goods. There is a 5% chance that the customer scrolls down and spots a type-2 good (if still available). A customer who sees both goods and has a preference for good 2 will buy good 2 with 10% chance and will not buy at all with 90% chance. A customer who either sees both goods and has a preference for good 1 or sees only good 1 will buy good 1 with a 5% chance and will not make a buy at all with 95% chance. For simplicity, we assume that each customer buys at most one good.
The data collected is a sample (x 1 , y 1 , i 1 ), . . . , (x n , y n , i n ) where n is the sample size, i.e., the number of customers visiting the website during a certain test period. Here, i k is either 0 or 1, depending on whether Algorithm 0 or 1 was used to display the goods to the k th customer. Further, x k = 1 or x k = 0 depending on whether good 1 was bought or not and, analogously, y k = 1 or y k = 0 depending on whether good 2 was bought or not. Notice that by our assumption that each customer buys at most one good, we have x k + y k ∈ {0, 1}. Those (x k , y k , i k ) with i k = 0 are assigned to sample 0 and the others to sample 1. We write n 0 := n k=1 (1 − i k ) and n 1 := n k=1 i k for the corresponding sample sizes. The numbers of sales in each group are 0 = n k=1 (x k + y k )(1 − i k ) and 1 = n k=1 (x k + y k )i k , the total number of sales is = 0 + 1 . The empirical probabilities for The website owner wants to find out whether Algorithm 1 performs better than Algorithm 0. It is a common approach to test for the higher probability of a sale by assuming an independent sample and using a G-test or the asymptotically equivalent two-sample chi-squared test. The hypothesis is that the conversion rates are identical in both samples. For simplicity, in the paper (a) Simulation of sales for Algorithms 0 and 1 run on separate inventories. Both algorithms sell all 1 000 attractive goods. Algorithm 0 additionally sells 199 528 goods of type 1, Algorithm 1 sells 199 325. The differences between the two algorithms are within the fluctuations one expects. Surely, this simulation does not give rise to the conclusion that Algorithm 1 outperforms Algorithm 0. However, Algorithm 1 sells the attractive goods earlier.  at hand, we shall always consider the chi-squared test. The test statistics for the latter is The hypothesis is rejected if χ 2 > q 1−α where α ∈ (0, 1) is the significance level and q 1−α is the (1−α)-quantile of the chi-squared distribution with one degree of freedom, see [3,Chapter 17].
Throughout the paper, we shall return repeatedly to Example 1.1 and discuss it in the light of our findings.
We shall discuss a variant of this example later on showing that ignoring the dependencies might also lead to too high false negative rates, see Example 3.8 below.

Model assumptions
We return to the general situation, in which a website offers two types of goods, good 1 and good 2. During a test phase, in which a new website design is used in parallel, the website has n visitors. Suppose that the website has a practically unlimited supply of items of good 1, while there are only c n ∈ {1, 2, 3, . . .} =: N items of good 2. Typically, n will be very large and c n will also be large, but significantly smaller than n. Whenever a user visits the website, a coin with success probability p is tossed. If the coin shows heads, the new website design is displayed, whereas if the coin shows tails, the old design is displayed. We thus observe a sample ((x 1 , y 1 , i 1 ), . . . , (x n , y n , i n )) ∈ ( Here, x k and y k are the numbers of goods of type 1 and 2, respectively, that have been bought by the k th visitor of the website during the test phase, while i k = 1 if the new design has been displayed to the k th visitor, and i k = 0, otherwise. We consider ((x 1 , y 1 , i 1 ), . . . , (x n , y n , i n )) as the realization of a random vector ((X 1 , Y 1 , I 1 ), . . . , (X n , Y n , I n )). We define Z k := 1 {X k +Y k >0} to be the indicator of the event that the k th customer bought something. Further, we set r k := (s k , t k ) := (x 1 , y 1 ) + . . . + (x k , y k ) and R k := (S k , T k ) := (X 1 , Y 1 ) + . . . + (X k , Y k ) for k = 0, . . . , n where the empty sum is defined to be the zero vector.
2.1. The classical model assuming independence. Many website owners in e-commerce use the G-test or the chi-squared test in the given situation. This test only uses the information whether or not a good was purchased, that is, the only information from the sample ((X 1 , Y 1 , I 1 ), . . . , (X n , Y n , I n )) used by the test is (Z 1 , I 1 ), . . . , (Z n , I n ). This amounts to the following model assumptions.
(χ1) There is a sequence (I 1 , I 2 , . . .) of i.i.d. copies of a Bernoulli variable I with P(I = 1) = p = 1 − P(I = 0) ∈ (0, 1). (χ2) There are a random variable ζ and p 0 , p 1 ∈ (0, 1) such that for all k ∈ N and i = 0, 1, (χ3) The family ((I k , Z k )) k∈N is independent. Here, and throughout the paper, for x ∈ R d , we write δ x for the Dirac measure with a point at x. Assumptions (χ1) through (χ3) possess the following interpretations. (χ1): The random variable I k models the coin toss that is used to decide whether the new or the old website design is displayed to the k th visitor of the website during the test phase. (χ2): The random variable Z k is the indicator of the event that the k th visitor bought something. (χ3): The independence assumption in the context of the low inventory problem is made for simplicity. We question the feasibility of this assumption in the present paper.

A model incorporating low inventory of a popular good.
We propose a simple model in which we keep track of the inventory of a rare good. Throughout the paper, we shall refer to this model as the 'model incorporating low inventory'. By c n ∈ N we denote the quantity at which the rare good is available. The most important case we consider is where c n is asymptotically equivalent to a constant times √ n. However, as we need this assumption only occasionally, throughout the paper, we only assume that (c n ) n∈N is a non-decreasing unbounded sequence of integers which is regularly varying with index ρ ∈ (0, 1] at infinity 1 , that is, and further c n = O(n) as n → ∞, which is relevant only if ρ = 1. Notice that the case c n ∼ const· √ n is covered. Indeed, in this case, we have ρ = 1 2 . In the next step, we informally describe the evolution of the process (R k ) k∈N0 . Let C n : At each step, a coin with success probability p is tossed. Depending on whether the coin shows heads or tails, the walk attempts to make one step according to a probability distribution µ 1 or µ 0 , respectively, on N 2 0 . The step is actually performed if the walk stays in the strip C n . Otherwise, another independent coin with success probability q is tossed. If the second coin shows heads, the walk moves in each coordinate direction according to the attempted step as far as possible but stops at the boundary of C n . If the second coin shows tails, the walk stays put. Once the walk is on the boundary of C n , it moves there according to a one-dimensional random walk in horizontal direction.
On the other hand, if R k−1 ∈ ∂C n , then T k−1 = c n . In this case, The interpretations of these assumptions are the following. (A1): The random variable I k models the coin toss that is used to decide whether the new or the old website design is displayed to the k th visitor of the website during the test phase. The random variable J k models the preference of the k th visitor. If J k = 1, then the user must buy. Users with J k = 0 only buy when they get exactly what they want in the first place.
(A2): The random variable (ξ k , η k ) can be interpreted as the vector of goods that the k th visitor would buy when visiting the (displayed version of the) website if there was enough supply of these goods.
(A3): The random variable θ k can be interpreted as the amount of type-1 goods that the k th visitor would buy when visiting the website and finding only goods of type 1 left. (A4): This is an independence assumption which is made to keep the model as simple as possible.
(A5): The random variable (X k , Y k ) models what is actually bought by the k th user. This depends on the needs of the user, ξ k , η k and θ k , the remaining amount of the rare good 2 given by c n − T k−1 , and the user's preference J k . If there are enough goods available to meet the needs of the k th user, then the user will buy exactly the needed amounts, namely, ξ k of good 1 and η k of good 2. If good 2 is not available at a sufficient quantity, then the user will either buy as much as possible of each of the goods if J k = 1 or nothing at all if J k = 0. If there is nothing left of good 2, the user will buy θ k of good 1. 2 Notice that in both models, P depends on p, which is not explicit in the notation. While in large parts of the paper, p is fixed, in some places, however, it is important to make the dependence of P on p explicit. In these places, we write P p . Often, this will be P 0 or P 1 , which correspond to the situations where only one version of the website is used. Let us introduce some notation for various characteristics of the above variables which we shall use throughout the paper.
Notice that m ξ and m η depend on p even though this is not explicit in the notation.
• The covariance matrices of the probability measures µ i , i = 0, 1 are denoted by C i , i = 0, 1, respectively. The covariance matrix of the probability measure µ p is then • We denote by m θ = E[θ] and σ 2 θ = Var[θ], the mean and the variance of the probability measure ν.
Example 2.1 (Ranking experiment, take 2). We return to Example 1.1 and briefly explain how this example fits into the framework of the above model. The number of visitors during the test phase is n = 4 · 10 6 . The quantity of the attractive good 2 is c n = 1 000 = 1 2 · √ n. Website visitors view each version of the website with equal probability, so I 1 , I 2 , . . . have success probability p = 1 2 . Further, as can be readily seen from Figure 1, ( 1 10 δ (0,1) + 9 10 δ (0,0) ) = 948 1 000 δ (0,0) + 4 1 000 δ (0,1) + 48 1 000 δ (1,0) . Analogously, from Figure 2, we deduce Moreover, the law of θ is given by 19 20 δ (0,0) + 1 20 δ (1,0) . The variables J k are irrelevant in the given situation as step sizes here are at most one, hence there will never be the situation where a visitor attempts to buy more of the popular good than what is left. We conclude that the theoretical conversion rates are given by and p θ = 5 100 . We shall see that if c n = c √ n, then the chi-squared test will reject the hypothesis with probability tending to 1 as c becomes large. On the other hand, we shall demonstrate that Algorithm 2 does not perform better given the model assumptions (A1) through (A5).

Testing for the higher conversion rate
We address the question which algorithm, when used alone, leads to the higher conversion rate, where the conversion rate is the total number of sales divided by the total number of visitors. More formally, for i = 0, 1, we define which model the number of visitors of website version i and the number of those visitors who make a purchase. We set L n := L n , which is the total number of purchases, and notice that N is the empirical conversion rate in group i. We stipulate that C (i) n under P p is the empirical conversion rate when only website i = p is used. In view of this, version 1 of the website is better than version 0 if C (1) n under P 1 is 'larger' than C (0) n under P 0 . Here, the term 'larger' is not specified a priori, so we need to clarify what we mean by this.
Consequently, testing whether C n under P 0 can be formulated as follows: More interest, in fact, would be in the corresponding one-sided test problem. Hence, in this case, it is a classical test problem and widely used tests for this problem are the chi-squared test, the G-test, and Fisher's exact test. To keep the paper short, we shall always restrict attention to the chi-squared test. For the reader's convenience, we recall some facts about this test. The test statistics for the chi-squared test is (3.1) If (χ1) through (χ3) are in force, as n → ∞, the distribution of χ 2 approaches a chi-squared distribution with 1 degree of freedom [3,Chapter 17]. Write q 1−α for the 1 − α quantile of the chi-squared distribution with 1 degree of freedom. Then, with significance level of α, the hypothesis is rejected if χ 2 > q 1−α .
3.2. The limiting law of the chi-squared test statistics in the model incorporating low inventory. Now suppose that there is a rare but popular good, i.e., suppose that the model assumptions (A1) through (A5) hold. One goal of this paper is to point out in a quantitative way that when (A1) through (A5) instead of (χ1) through (χ3) are in force, then the chi-squared test may produce too many false positives. In other words, it may fail to hold the specified significance level. This is because the distribution of χ 2 under the null hypothesis is different when (A1) through (A5) rather than (χ1) through (χ3) are in force. The detailed statement is given in the following theorem.
Theorem 3.1. Suppose that (A1) through (A5) and (2.1) are in force. Assume additionally that Then the chi-squared statistics defined by (3.1) satisfies where N is a standard normal random variable.
Hence, if one applies the chi-squared test with significance level α ∈ (0, 1) in the given situation, the test rejects the hypothesis with (asymptotic) probability In fact, the probability on the left-hand side tends to 1 as d ∞ → ∞. We specialize to the situation of Example 1.1. and This becomes worse as d ∞ becomes larger, see Figure 5 below. At first glance, this may occur to be no problem as p 1 > p 0 , so one is tempted to guess that algorithm 1 performs better than algorithm 0 and what we see above is just the power of the test, which becomes better as d ∞ becomes large. However, we shall argue in Example 3.6 below that the two algorithms perform equally well when used separately.
Next, we show that in the general situation, assuming that (A1) through (A5) hold and that cn n → 0, we show that on the linear scale, the asymptotic empirical conversion rates of the two versions of the website, when used separately, are identical. Hence, in the relevant regime ( cn n → 0) the first order of growth of C (p) n depends only on what happens after the popular good is sold out. According to our model assumptions, the two versions of the website have identical performance once the popular good is sold out. This implies that on the linear scale, there is no difference between the two versions of the website. Hence, we need to make a comparison on a finer scale.
3.3. A joint limit theorem for the group sizes and numbers of purchases in each group. As χ 2 is a function of (N We continue with the asymptotic law of the vector (N n ), suitably shifted and scaled in the most relevant scenario where c n is of the order √ n.
Theorem 3.5. Suppose that (A1) through (A5) are in force and suppose in addition to (2.1) that the limit exists, we have, as n → ∞, Notice that the theorem contains a limit theorem for the pure conversion rates C where m η = pm η 1 + (1 − p)m η 0 with p = 0 and p = 1, respectively, has been used. Hence, if the two expectations in (3.5) and (3.6) coincide, then the performances of the two websites coincide asymptotically both on the linear scale as well as on the level of fluctuations. The subsequent example demonstrates that this can be the case even if p 0 = p 1 . This means that the two limits in (3.5) and (3.6) coincide in the given example even though p 1 > p 0 .
From Theorem 3.5, we may immediately deduce a limit theorem for χ 2 , which is a preliminary version of Theorem 3.1.
Corollary 3.7. Suppose that (A1) through (A5) and (2.1) are in force and that the limit exists, we have the following limit theorem for the chi-squared statistics defined by (3.1) denotes the expectation of the right-hand side in (3.3) and (G 1 , G 2 , G 3 ) is the Gaussian vector from Theorem 3.5.
We close this section with another example showing that ignoring the dependencies might also lead to a high false negative rate.
Example 3.8 (Ranking experiment with picky customers). We consider a variant of Example 1.1 in which there are picky customers that will only buy the rare good. This time, we use the former Algorithm 1 from above as the default algorithm displaying the rare goods first. The former Algorithm 0 strategically keeps the rare inventory for later arrival of picky customers by ranking the rare good low. In an experiment with shared inventory, Algorithm 1 sells off the rare good greedily, and the value of Algorithm 0's strategy will not be properly assessed in the model ignoring dependencies. We make this precise in the following.
Again suppose that during a test phase, n = 4 000 000 customers visit the website. Again, there are c n = 1 000 rare goods while good 1 is available at sufficient quantities.
The algorithms work as in Example 1.1, but there is a difference in the behavior of the customers. We assume that, independent of all other customers, each customer has a 1% chance of being picky. If not picky, the customer behaves like the customers of Example 1.1. A picky customer, however, will search as long as required to check whether there is something of the rare good left. If the rare good is still available, the picky customer buys one unit with 50% chance. Otherwise, the customer leaves the website. The overviews given in Figure 2 still apply to regular customers, for picky customers and when good 2 is still available, there is a simplified decision tree: A calculation of the relevant model parameters in this example based on the corresponding parameter values from Example 3.2 gives p = 1 2 and d ∞ = 1 2 as before and p θ = 99 100 · 1 20 , p 0 = 99 100 · 52 1 000 + 1 100 · 1 2 , p 1 = 99 100 · 9 100 + 1 100 · 1 2 , m η = 99 100 · 42 1 000 + In view of (3.5) and (3.6), Algorithm 0 does perform better than Algorithm 1. Now let us calculate the probability that the chi-squared test rejects the hypothesis that Algorithm 0 and Algorithm 1 perform equally well. To this end, we first calculate According to Theorem 3.1, the chi-squared test (with significance level 5%) rejects the hypothesis with probability 0.1536348 . . . . However, it is a standard practice to say that Algorithm 0 is significantly better than Algorithm 1 only when χ 2 > q 1−α and C (0) n > C (1) n (with α ∈ (0, 1) being the significance level). So, asymptotically, the power of the test is lim n→∞ P(C (0) n > C (1) n , χ 2 > q 1−α ). We shall now calculate this probability in the given situation with d ∞ = 1/2 but also as a function of d ∞ to point out that the probability becomes arbitrarily small as d ∞ becomes large. We begin by reformulating the condition C By Theorem 3.5 and Slutsky's theorem, the two terms in the penultimate line tend to 0 in probability as n → ∞. By Theorem 3.5, the other summands converge in distribution so that in the limit, the above inequality becomes which can be simplified to 2m η . By Theorem 3.1, χ 2 converges also in distribution. According to Corollary 3.7, we can express the condition χ 2 > q 1−α in the limit as n → ∞ in the form Since the convergence in Theorem 3.5 is jointly and since the law of (G 1 , G 2 , G 3 ) on R 3 is absolutely continuous with respect to Lebesgue measure, the Portmanteau theorem implies that In the given situation (with d ∞ = 1 2 ), we have used a Monte Carlo simulation (with 2·10 6 iterations) to estimate the power of the test resulting in an estimate of 0.001933 . . . Further, it is already immediate that the power tends to 0 as d ∞ tends to ∞ since p 1 > p 0 . To get a better quantitative picture, In particular, for every value d ∞ > 0, the estimate of the power of the test is strictly smaller than 0.025 = α 2 . 3.4. Functional limit theorem for the model incorporating low inventory. We deduce Theorem 3.5 from a more general result, namely, a joint functional central limit theorem. To formulate it, we need additional notation.
Henceforth, convergence in distribution of random elements in the Skorohod spaces D([0, ∞), R d ) and D((0, ∞), R d ) of R d -valued, right-continuous functions with existing left limits is with respect to the standard J 1 -topology and will be denoted by =⇒. To distinguish between convergence in the above two spaces we adopt the following convention. The convergence is in D([0, ∞), R d ) if the processes are written with subscript (·) t≥0 , whilst if the subscript is (·) t>0 , the convergence is in D((0, ∞), R d ).
Let (B(t)) t≥0 = ((B 1 (t), . . . , B 7 (t))) t≥0 be a centered 7-dimensional Brownian motion with covariance matrix is a 7-dimensional Brownian motion with independent components each being a one-dimensional standard Brownian motion, and V 1/2 is the square root of the positive semi-definite matrix V.

Proofs
In the model, there are two natural breaks, namely, first the process evolves in the positive quadrant like an unrestricted two-dimensional random walk until the second coordinate of the walk for the first time attempts to step to or beyond the level c n . This time we call τ 1 (n). Then there is a number of attempts to reach that border until this is eventually achieved at a time we call τ 2 (n). From that time on, the walk keeps the second coordinate fixed at c n and evolves horizontally as a one-dimensional walk. Crucial both for the proof of the strong law of large numbers, Theorem 3.4, and the joint functional limit theorem, Theorem 3.9, is a sufficient understanding of these times, τ 1 (n) and τ 2 (n).

4.1.
Stopping time analysis and the strong law of large numbers. Formally, the two natural stopping times associated with the stochastic process (R k ) k∈N0 are defined as follows: • τ 1 (n) := inf{k ∈ N : T k−1 + η k ≥ c n }, the first attempt to reach the half-plane N 0 × [c n , ∞), and • τ 2 (n) := inf{k ∈ N : T k = c n }, the first entrance to the horizontal line N 0 × {c n }.
We stipulate that τ 1 (0) = τ 2 (0) = 0. To simplify notation later on, we set τ j (t) := τ j ( t ) for t ≥ 0 and j = 1, 2. The above quantities are stopping times with respect to the natural filtration of (R k ) k∈N0 . By the strong law of large numbers, we have Note that τ 1 (n) ≤ τ 2 (n) and We start by proving that τ 1 (n) and τ 2 (n) are uniformly close on the √ c n -scale. Proof. By the regular variation of (c n ) n∈N it is enough to prove the claim for T = 1. We have the following bound  From (4.1) we conclude that for any a > 1 m η there exists a random n 0 ∈ N such that for all n ≥ n 0 , we have sup Fix ε > 0 and let us show that (4.7) holds for any δ ∈ (0, εE S 1 ) = (0, εqm η ). To this end, fix arbitrary such δ and write P min m=1,..., acn For every λ > 0 we have by Markov's inequality It remains to note that e λδε −1 (E[e −λη1 {J=1} ]) < 1 for δ ∈ (0, εqm η ) and sufficiently small λ > 0. The proof is complete.
For the proof of Theorem 3.4 we also need a counterpart of the above lemma for convergence in the almost sure sense. Proof. In view of (4.2) for every ε > 0 it holds By the classical strong law of large numbers for ( S k ) k∈N0 and in view of (4.1) lim n→∞ S τ1(n)+ εcn − S τ1(n) c n = εqm η > 0 and lim n→∞ T τ1(n) c n = 1 a. s.
Therefore, with probability one, the event in (4.8) occurs only for finitely many n.
Proof of Theorem 3.4. Note that for i = 0, 1 we can write for t ≥ 0. The following proposition is the key ingredient in the proof of Theorem 3.9. Proposition 4.3. If the assumptions of Theorem 3.9 hold, then, as n → ∞, where (B(t)) t≥0 is a centered Brownian motion with covariance matrix V as in (3.8).
Proof. The convergence follows from Donsker's invariance principle since (X n (t), Y n (t)) is the sum of independent identically distributed random vectors in R 7 with finite second moments. The increment vectors have expectation The explicit form of the covariance matrix now results from elementary, yet cumbersome calculations. For example the entry at the first row and fourth column of V can be calculated as follows: In what follows, we shall frequently use the following two facts (see the Lemma on p. 151 in [1] and [4, Theorem 3.1]): Fact 1: the addition mapping + : , is continuous with respect to the J 1 -topology at all points (f, g) such that both f and g are continuous; Fact 2: the composition mapping • : ) is continuous with respect to the J 1 -topology at all points (f, g) such that both f and g are continuous and g is nondecreasing.
Proof of Theorem 3.9. From [5, Corollary 7.3.1], (2.1) and Fact 2, we infer As in (4.9), for i = 0, 1, we can write The second summand above is bounded by τ 2 (nt) − τ 1 (nt) + 1 and thus the supremum over t in a compact interval divided by √ c n = O( √ n) converges to zero in probability by Lemma 4.1 as n → ∞. The behavior of the second and the third summand strongly depends on whether lim n→∞ cn n is smaller, larger or equal to 1 m η . Given 0 < a < b and i = 1, 2 we put We first deal with the case c ∞ ∈ (0, 1 m η ). In this case, we necessarily have ρ = 1. By Eq. (4.10), Lemma 4.1 and the uniform convergence theorem for regularly varying functions we obtain for i = 1, 2 and arbitrary 0 < a < b  nt , i = 1, 2. For typographical reasons we shall write convergences of various components in separate formulas keeping in mind that they actually converge jointly in view of (4.11). Firstly, and therefore using Fact 2 (continuity of the composition mapping), the continuous mapping theorem and (4.12) τ1(nt)−1 k=1 as n → ∞.

Replacing
√ c n in the denominators by √ c ∞ n and summing everything up we get (4.17) as n → ∞. It remains to note that (4.17) holds jointly with and together this is (3.9).
Case c ∞ = 0. In this case (4.13) still holds but there are significant simplifications. First of all note that in this case we can have ρ ≤ 1 and thus in (4.12) 1 m η t must be replaced by 1 m η t ρ . Further, in (4.15) and (4.16) upon replacing √ c n in the denominator by √ n the limit becomes identical zero and, thus we have the same convergence (4.17) but with c ∞ = 0 on the right-hand side.
Finally, we deal with the case c ∞ > 1 m η . In this case (4.12) implies We now turn to the proof of Theorem 3.1. As a first step, we notice that Corollary 3.7 follows from Theorem 3.5 and the continuous mapping theorem. Theorem 3.5, in turn, follows immediately from Theorem 3.9. It thus remains to deduce Theorem 3.1 from Theorem 3.5.
Proof of Theorem 3.1. Our aim is to show how to calculate the distribution of the variable where G := (G 1 , G 2 , G 3 ) is a centered Gaussian vector with covariance matrix V 1 . A simple calculation shows that Note that p θ G 1 − pG 2 + (1 − p)G 3 has a centered normal distribution with the variance Thus, for N having the standard normal distribution, we see that The proof is complete.

Conclusions
Starting from the observation that the standard for testing product changes on e-commerce websites is large scale hypothesis testing with statistical tests based on the assumption of independent samples such as the chi-squared test, we have suggested a new model for the samples which incorporates shared inventories. This model introduces new dependencies. Our main result is the calculation of the asymptotic law of the chi-squared test statistics under the new model assumptions in the critical regime where the number of items of a popular good is of the order of the square root of the sample size. Website versions that greedily sell the popular good have an initial advantage in the number of sales of the order of the square root of the sample size, which is the order of the overall random fluctuations. Thus the initial advantage has an impact on the probability of rejecting the hypothesis. We have demonstrated in examples that this may lead to both, arbitrarily high false-positive as well as arbitrarily high false-negative rates. This questions the assumption implicit in the industry standard that dependencies are small enough to be ignored. Moreover, it suggests that the present standard of A/B testing favors algorithms that are designed to be good in competition against others, but not necessarily good when used on separate inventory.
Our work may be extended in the future in several directions. On the one hand, our results may be used to construct tests for the model that keep the significance level. On the other hand, the model may be extended to incorporate more features of real samples such as a priori information about website visitors etc.