On the use of comparison regions in visualizing stochastic uncertainty in some two‐parameter estimation problems

When considering simultaneous inference for two parameters, it is very common to visualize stochastic uncertainty by plotting two‐dimensional confidence regions. This allows us to test post hoc null hypotheses about a single point in a simple manner. However, in some applications the interest is not in rejecting hypotheses on single points, but in demonstrating evidence for the two parameters to be in a convex subset of the parameter space. The specific convex subset to be considered may vary from one post hoc analysis to another. Then it is of interest to have a visualization allowing to perform corresponding analyses. We suggest comparison regions as a simple tool for this task.


INTRODUCTION
When considering simultaneous inference for a two-dimensional parameter vector = ( 1 , 2 ) ′ , it is very common to visualize the point estimate and its stochastic uncertainty by plotting the point estimate and a (1 − )-confidence region in a twodimensional plane. This allows a simple judgement whether for any given value 0 the null hypothesis 0 ∶ = 0 can be rejected at level or not, just by inspecting the visualization and checking whether 0 is outside or within the confidence region. This is in particular helpful in post hoc testing, that is, if there is an interest in testing a specific hypothesis generated after publication of the original analysis.
However, in some applications, it can be expected that the interest is not in rejecting certain values of , but in demonstrating evidence for to be in a convex subset of the two-dimensional parameter space. A few examples should illustrate this point.
(i) Two-dimensional equivalence problems In establishing new measurement or intervention procedures, it is often of interest to demonstrate that certain characteristics have not changed in comparison to established procedures. Such characteristics can often only be estimated, and equivalence is then established by demonstrating that the true difference is within a certain margin ± . If two characteristics are considered simultaneously, we can consider two margins 1 and 2 . Then we may aim to demonstrate that the point corresponding to the two true differences 1 and 2 is within the corresponding rectangle, that is, a convex set. It has been argued that it might be preferable to consider alternative types of convex sets, allowing some type of compensation between 1 and 2 . Consequently, besides rectangles, ellipses also have been considered as the regions of interest in bivariate equivalence problems, cf. Hoffelder, Goessl, and Wellek (2015).

(ii) Joint evaluation of sensitivity and specificity
In evaluating the sensitivity and the specificity of a diagnostic test, it is often cumbersome to specify in advance the two necessary thresholds to perform one sided hypothesis tests for each parameter: If the test performs much better than expected for one parameter, one may be satisfied with a lower value for the other parameter. To circumvent this problem, one may specify a weight and a threshold and aim at demonstrating + (1 − ) > , cf. Vach, Gerke, and Hoilund-Carlsen (2012). However, clinicians, patients, or policy makers may have different opinions on how to choose weights or thresholds. Consequently, it might be necessary to take several choices into account simultaneously, that is, to demonstrate + (1 − ) > for = 1, … , . Both with a single choice or a multiple choice the aim is to demonstrate that the true values of sensitivity and specificity are within a convex subset of the parameter space.

(iii) Cost-effectiveness analyses
In cost-effectiveness analyses, we compare the incremental average costs Δ( ) of a new intervention with the gain in effectiveness Δ( ) with respect to some patient relevant outcome, based on estimates of both. A crucial quantity is the cost effectiveness ratio Δ( ) ∕ Δ( ) . Given a specific value representing the acceptable ratio, we can consider the net benefit = Δ( ) − Δ( ) , expressing how much the effectiveness exceeds the necessary effectiveness Δ( ) according to the actual costs and the acceptable ratio (Stinnett & Mullahy, 1998). The aim is to demonstrate ≥ 0, or at the best ≥̃for somẽ, that is, that the pair ( Δ( ) , Δ( ) ) is in a given convex subset of the the cost-effectiveness plane. Note that the net benefit approach is applicable also if Δ( ) , Δ( ) , or both are negative. A similar situation arises when there is a need to balance the benefit and the risk of medical interventions. Indeed, many approaches in benefit-risk assessment consider a benefit-risk plane similar to the cost-effectiveness plane (Guo et al., 2010;Mt-Isa et al., 2014). Gladstone and Vach (2015) suggested to evaluate noninferiority trials by considering an advantage-deficit plane.
It is a first common property of the three examples that they involve half spaces or other convex sets as regions of interest. Here we denote with region of interest a subset of the two-dimensional parameter space for which we are interested to demonstrate that the true parameter values are within this region of interest. They share also a second property: The concrete regions of interest may be defined after publication of the original analysis. Equivalence margins may change due to new regulatory guidelines, or a new type of application may require a stricter margin or may allow a more liberal margin. New stakeholders may have a different opinion on how to weight sensitivity and specificity and which threshold to be used. Policy makers may change their opinion about the acceptable cost-effectiveness ratio. Hence, there is a need to visualize the two parameter estimates and their uncertainty in a way, which allows us to check easily for any given (convex) region of interest, whether we can reject the null hypothesis of not to be in this region, that is, to generate evidence for the true parameter value to be covered by the region of interest. So we have to support this way of post hoc testing.
We start this paper by looking at the role of confidence intervals and confidence regions with respect to supporting post hoc testing. This way we will identify the need for a new type of region, which we call comparison regions. Then we will present a general construction principle for comparison regions based on one-sided tests for linear combinations of two parameters. We consider the special cases of using likelihood ratio (LR) tests or Wald tests, respectively, allowing to provide explicit representations of comparison regions and a comparison with the shape and size of confidence regions. We further present an illustrative example considering the case of two independent estimates of proportions and take a short look at finite sample properties. Our considerations will take into account that in some applications regions of interest are bounded convex sets, whereas in other applications regions of interest can be unbounded, but share the directions in which they are unbounded.

THE ROLE OF CONFIDENCE INTERVALS AND CONFIDENCE REGIONS IN POST HOC TESTING
In one-parameter problems, we are used to depict stochastic uncertainty of a parameter estimatêby presenting a (1 − )confidence interval and a p-value referring to a test for a prespecified null hypothesis. Most research questions require to demonstrate that the true parameter value is above a certain threshold , that is, the region of interest is given by = { | > }. This suggests to use one-sided tests for the null hypothesis 0 ∶ ≤ . However, there is a tradition to prefer two-sided tests The null hypotheses 0 ∶ ∉ 1 and 0 ∶ ∉ 2 can be rejected, and the null hypothesis 0 ∶ ∉ 3 cannot be rejected for the null hypothesis 0 ∶ = , mainly to avoid the temptation to reverse the region of interest post hoc. Analogously, we prefer to present two-sided confidence intervals to one-sided confidence intervals. If we explicitly use one-sided tests, we typically choose the level ∕2 to avoid an advantage compared to two-sided testing.
There is a well-known correspondence between confidence intervals and hypothesis tests. If a family ( ) ∈ of level-tests for 0 ∶ = is given, then ∶= { | = 0} defines a (1 − )-confidence interval. And if a (1 − )-confidence interval is given, ∶= 1( ∉ ) defines a level-test. Consequently, confidence intervals play a role not only for illustrating the imprecision of̂, but also in supporting post hoc testing: Whenever after publication of a statistical analysis there is the interest to test the null hypothesis 0 ∶ = , we can perform a corresponding hypothesis test by checking whether is not covering . This can also be extended to the situation where the region of interest is a half interval as considered above, if the two sided (1 − )-confidence interval = [ low , up ] is constructed in a way such that both [ low , ∞] and [−∞, up ] are (1 − ∕2)confidence intervals. Then the rule 1( ⊆ ) corresponds to a one-sided level-∕2 test of 0 ∶ ∉ . This rule to check whether the confidence interval is completely within the region of interest is also probably what many readers of scientific publications actually do if they have to interpret a confidence interval. Figure 1 illustrates this point.
We are less used to two-parameter problems, as they occur rather rarely in biomedical research. In general, the same principles as in the one-parameter case should be applied: We are interested in depicting the stochastic uncertainty of the two-dimensional parameter estimates and the evidence we have against a prespecified null hypothesis, reflecting the research question of interest. The first can be approached by a two-dimensional confidence region, the latter by a p-value of a corresponding hypothesis test. We still have a correspondence between confidence regions and hypothesis tests, as long as we consider tests for a single point null hypothesis. So, confidence regions also support the post hoc testing of such null hypotheses.
However, as pointed out above, in two parameter problems the regions of interest are often half spaces or convex sets, such that corresponding hypotheses tests for 0 ∶ ∉ are not testing single point null hypotheses. To support post hoc testing corresponding to such regions of interest, we are hence interested in alternative data-dependent regions , such that 1( ⊆ ) defines a level-test for 0 ∶ ∉ for any convex set . We will call such a region a level-comparison region, and Figure 2 illustrates the use of such a comparison region in post hoc testing. And such comparison regions do exist. As already pointed out by Munk and Pflüger (1999)-also referring to previous work by Aitchison (1964) and Lehmann (1959)any (1 − )-confidence region provides also a level-comparison region: If 0 ∉ , then ⊆ implies 0 ∉ , and hence ( ⊆ ) ≤ ( 0 ∉ ) ≤ by definition of a confidence region. However, the comparison of with a (1 − )-confidence region is a very conservative approach. We will demonstrate in this paper that it is possible to construct level-comparison regions, which are distinctly smaller than (1 − )-confidence regions, and hence provide a more powerful approach to post hoc testing. Consequently, if we want to support post hoc testing with respect to convex regions of interest, we should present not only confidence regions, but also comparison regions.

A GENERAL CONSTRUCTION PRINCIPLE
If we have for any half space of R 2 a test for 0 ∶ ∈ , then a comparison region has to fulfil ⊆̄whenever = 1, that is, whenever we can reject 0 ∶ ∈ . Consequently, a straightforward idea is to define as the intersection of the complement of all half spaces, which we can reject when considered as null hypothesis. In Appendix A, we show that this θ 1 θ 2 t H t,r F I G U R E 3 Visualization of the role of t when using the parametrization , ∶= {( 1 , 2 ) ′ | 1 cos + 2 sin ≥ } with ∈ [0, 2 ) and ∈ R idea indeed works under some regularity conditions on the family of tests considered. It is sufficient to consider half spaces, as any convex set can be represented as an intersection of half spaces, and the intersection-union principle allows to define a test based on testing all half spaces.
However, performing the above construction principle explicitly for a given family of tests is rather cumbersome. Fortunately, our main result also provides a simple sufficient condition for comparison regions. This way we can easily check whether a given definition of a convex set defines indeed a comparison region.
To present this main result-also derived in Appendix A-we need to introduce the concept of the direction of a half space. Any half space of R 2 can be expressed as , ∶= {( 1 , 2 ) ′ | 1 cos + 2 sin ≥ } with ∈ [0, 2 ) and ∈ R. The parameter represents the direction of the half space, as it corresponds to the first polar coordinate of all points on a line orthogonal to the boundary line of the half space and hitting the origin (cf. Figure 3). For any direction , it is reasonable to assume that , can be rejected if is large enough, and by regularity conditions we can ensure that there is a minimal , for which , can be rejected. And we can expect that this half space plays a role in characterizing a comparison reason. Indeed, they will coincide with the outer tangent halfspace , of C in direction . This outer tangent half space is the half space with direction with a boundary line equal to a tangent of and disjoint to .
Our main result takes into account that in many applications not all possible directions ∈ [0, 2 ) are of relevance. If all regions of interest are unbounded in some direction, the comparison region is allowed to be unbounded in this direction, too, and there is no outer tangent half space in this direction. We can now present out main result as a theorem after introducing two regularity conditions: The family of tests should be monotone and it should be right-continuous, that is, if for any sequence ( Theorem 3.1. Let ( ) ∈ be a family of level--tests for 0 : ∈ , satisfying conditions (1) and (2). Then any open, convex set C with ⊇ and defines a level--comparison region for all R with ⊆ .
It should be emphasized that the theorem specifies a sufficient, but not a necessary condition. In particular, any superset of a comparison region is by definition a comparison region, too. We will make use of this fact in the sequel, where we first construct bounded comparison regions. In a second step, we enlarge them to unbounded comparison regions, restricting boundedness to the directions of interest.

LR TEST AND WALD TEST BASED COMPARISON REGIONS
In many two-parameter estimation problems, we can define hypothesis tests for any half space using the LR test principle or the Wald test principle. Hence, it is of interest to know how comparison regions based on these principles look like. This is considered in detail in Appendices B and C. We present here only the main results as two lemmata.

LR test-based comparison regions
Lemma 4.1. Let ( ) be a continuous and strictly concave log-likelihood function of a two-parameter problem and l * the value at the maximum likelihood estimator. Then the contour set is a level--comparison region, for any choice of T and < 0.5.

Wald test-based comparison regions
The level--Wald test for 0 ∶ ∈ , is given by where is the -percentile of the standard normal distribution. Theorem 3.1 allows to derive the following result:

EXAMPLE: JOINT EVALUATION OF SENSITIVIT Y AND SPECIFICIT Y
In accuracy studies on a single diagnostic test, we observe 1 diseased patients with 1 having a positive test result, and 0 disease-free patients with 0 having a negative test result. The two estimates of interest arê= 1 ∕ 1 and̂= 0 ∕ 0 , that is, two independent proportions. For the construction of comparison regions, we can consider the LR test based on We illustrate the use of comparison regions in analyzing diagnostic accuracy using a multicenter study in 308 patients investigating the use of angiography-based quantitative flow ratio (QFR) measurements for online assessment of coronary stenosis (Xu et al., 2017). The study explores the sensitivity and specificity of QFR when using fractional flow reserve as reference standard. It results in estimates for sensitivity of 94.6% (95% CI [88.7, 98.0]) and for specificity of 91.7% (95% CI [87.1, 95.0]).
In Figure 4, 5% comparison regions and 95% confidence regions following both principles are shown for the data of this example. In this presentation, we take into account that in considering a weighted average of sensitivity and specificity, we are only interested in nonnegative weights. So all regions of interest represented by half spaces are unbounded for any ∈ (0, ∕2), and this holds also for intersections of such regions. So when starting with the full comparison region presented in the previous section, we can enlarge the comparison region in these directions and arrive at the presentation given in Figure 4.
The Figure illustrates nicely that in post hoc testing the use of comparison regions represents a more powerful approach than the use of confidence regions. For example, when we are interested to demonstrate that the average of sensitivity and specificity is above 0.9, we would be unable to reject the null hypothesis 0.5( + ) < 0.9 when we use the comparison with a confidence region, but we would be able to reject this null hypothesis when we use the comparison with the comparison region.

A SMALL INVESTIGATION OF FINITE SAMPLE PROPERTIES
We can observe in Figure 4 that LR test-based comparison regions are distinctly wider than Wald test-based comparison regions, mimicking the behavior of confidence regions. It is well known that Wald test-based confidence intervals for proportions behave poorly in small samples, in particular if the true probabilities are close to 0 or 1 (Agresti & Coull, 1998). We may expect such a behaviour also for Wald test-based comparison regions.
We investigate this question in a small investigation of finite sample properties. We consider a true sensitivity of 0.95 and and true specificity of 0.9-motivated by our example-and select the sample size accordingly to our example, but consider a doubled sample size also. We evaluate the exact level of post hoc tests based on a comparison of the region of interest with the comparison region. As region of interest, we consider five different half spaces and three convex sets built by intersection of T A B L E 1 Probability P( ⊆ ) for eight different regions of interest with = 0.95, = 0.9, and the 5% comparison region based on the LR test or the Wald test, respectively. Sample sizes are based on the observed values in the study of Xu et al. (2017), taken single and double. The probability P( ⊆ ) is computed by complete enumeration over all possible values of 1 and 0 . ⊆ is assumed to hold if any of the 100 boundary points used in constructing the lower left quarter boundary line of the comparison region are covered by half spaces. All share the property that the point corresponding to true sensitivity and true specificity is on the borderline of the region of interest. Two observations in Table 1 are noteworthy: First, if the regions of interest are half spaces, Wald test-based 5% comparison regions define a test with an actual level distinctly larger than the nominal level of 5%, in particular if the sample size is limited. This also holds for LR test-based 5% comparison regions, but to a distinctly lower degree, in particular if the borderline of half spaces are not parallel to the coordinate axes. Second, when the regions of interest are intersections of half spaces, comparison regions turn out to be rather conservative. This is not surprising, as our construction principle aimed at hitting with the tangents of the comparison regions exactly the boundary line of the maximal half space, which we can support as region of interest by hypothesis testing. For intersections of half spaces, we rely on the intersection-union principle, which is known to be conservative.

RELATION TO PREVIOUS WORK
A first systematic investigation about interpreting confidence regions as comparison regions has been given by Aitchison (1964). In that paper, it was argued that the level of confidence regions has to be increased to allow an interpretation as a comparison region, as testing more than one single point hypothesis requires to adjust for multiple testing. We demonstrated in our paper that the opposite is true.
Our results can be seen as a generalization of the confidence interval inclusion principle in one-dimensional equivalence testing, cf. chapter 3.1 in Wellek (2010): If a (1 − 2 )-confidence interval for ∈ R is given, then for any bounded interval the rule to reject 0 ∶ ∉ if ⊆ defines a level--test. Moreover, can be expressed as the intersection of two (1 − )confidence intervals, cf. Schuirmann (1987), which corresponds to our general construction principle. Munk and Pflüger (1999) considered already a generalization of the interval inclusion principle to the higher dimensional case by demonstrating that in general (1 − 2 )-confidence regions are -comparison regions under certain regularity conditions. However, to obtain a 5% comparison region in the two-dimensional case, their approach suggests to use a 90% confidence region as comparison region, whereas our approach allows to use the distinctly smaller 74.2% confidence region. Note that our approach can easily be generalized to the -dimensional case.

DISCUSSION
In this paper, we have introduced the concept of comparison regions to support post hoc testing in two-parameter estimation problems. Comparison regions allow to check visually whether we have evidence for the true parameter values to be in a given convex subset of the parameter space. In several areas of statistical analyses, we are confronted with the task to prepare such post hoc testing in publishing the results of a study.
We have suggested a first construction principle for comparison regions, but we cannot claim any optimality of our approach. The question of optimality should be a topic of future research. Our results for LR test and Wald test based comparison regions also raise the question: whether any convex 74.2% confidence region may define a 5% comparison region. We would like to point out already one important difference to confidence regions: Although confidence regions keep their defining property under monotone transformations of the single parameters, this does not hold for comparison regions. This is a simple consequence of the fact that already the convexity of subsets can be destroyed by such transformations.
One simple difference between confidence regions and comparison regions is that the latter are just smaller. This result may be a little bit counterintuitive, as when considering a half-space as a null hypothesis, we test many single point null hypotheses (cf. Aitchison, 1964). However, the property of being a convex set poses a structure onto this testing problem, such that it can be approached by a one-dimensional testing problem requiring only one degree of freedom. This stands in contrast to twodimensional confidence regions related to tests requiring to use two degrees of freedom.
It might also be argued that in two-parameter estimation problems there is some danger to reverse regions of interest post hoc. This would suggest to use an ∕2-level comparison region, such that post hoc tests of half spaces are automatically performed at level ∕2. Indeed, hypotheses tests of half-spaces are logically one-sided tests (cf. the structure of the Wald test considered in Section 4.2). Whether there is a risk to reverse regions of interest in two-parameter estimation problems is a question of debate, which can only be clarified in the context of a concrete application. Moreover, sometimes we can limit this risk by a careful choice of how to present a comparison region. For example, the presentation chosen in our example does only support to check region of interests based on nonnegative weights and a lower threshold. If we consider an upper threshold, we can never reach ⊆ . In any case, it should be noted that ∕2-level comparison regions are still smaller than (1 − )-confidence regions. For example, 2.5% Wald test and LR test based comparison regions coincide with 85% confidence regions.
As long as post hoc testing is performed with respect to half spaces, the use of comparison regions seems to provide an efficient approach. This is not the case for convex regions that can be represented by an intersection of half spaces with varying directions. In this situation, it might be desirable to perform explicitly a test taking into account the shape of the region of interest. However, this requires that sufficient information is given in the publication. An example of this type arises in accuracy studies, when a stakeholder is not willing to consider a weighted average of sensitivity and specificity, but requires to demonstrate that both sensitivity and specificity are above certain thresholds. Then we can require that the corresponding tests are both significant at level 1 − √ 1 − . We can perform these tests if both nominator and denominator of the sensitivity and specificity estimates are given, which is typically the case. This approach is more powerful than using the comparison with the comparison region.
We would like to recommend the use of comparison regions in publications, whenever it is expected that convex subsets of a two-dimensional parameter space have to be checked later for compatibility with the data. However, it can never be excluded that later single point hypotheses also need to be checked. Then it would be very dangerous if comparison regions are confused with confidence regions, as the actual coverage probability of an -level comparison region is much less than 1 − . We hence recommend to present both confidence regions and comparison regions, and guiding the user in the correct application by using a dotted line for the boundary of confidence regions-reminding on the comparison with a single point-and using a solid line for the boundary of comparison regions-reminding on the comparison with a convex set.

ACKNOWLEDGMENT
This work was supported by the German Research Foundation (DFG) [VA 88/5-1]. The article processing charge was funded by the University of Freiburg in the funding programme Open Access Publishing.

CONFLICT OF INTEREST
The authors declare that there is no conflict of interest.

SUPPORTING INFORMATION
Additional supporting information including source code to reproduce the results may be found online in the Supporting Information section at the end of the article.

APPENDIX A: DERIVATION OF THE MAIN RESULT
Any closed half-space can be expressed as , ∶= {( 1 , 2 ) ′ | 1 cos + 2 sin ≥ } with ∈ [0, 2 ) and ∈ R . Furthermore, ( , ) ∈[0,2 ), ∈R provides a parametrization of the set  of all closed half spaces of R 2 . It is well known that any open convex subset of R 2 can be expressed as the intersection of the complements of its tangent half spaces, that is, We would like to note that all considerations presented so far can also be applied if the parameter space is restricted to a closed, convex subset Θ of R 2 . This requires only to consider  Θ ∶= { ∩ Θ| ∈ , ∩ Θ ≠ ∅} instead of . As Θ is closed and regions of interest are open, the concept of tangent sets is still applicable in  Θ .

APPENDIX C: WALD TEST-BASED COMPARISON REGIONS
If̂is asymptotically Gaussian distributed with mean and variance Σ, and a consistent estimateΣ for Σ is available, the level--Wald test for 0 ∶ ∈ , is given by where is the -percentile of the standard normal distribution. The Wald test can also be used to construct (1 − )-confidence regions for . They are given by * = { | ( −̂) ′Σ−1 ( −̂) <  2 2,1− } Again, we can observe that comparison and confidence regions have similar shapes. As 2 1− =  2 1,1−2 , we have again that a 5% comparison region corresponds to a 74.2% confidence region.