Modelling High-Dimensional Categorical Data Using Nonconvex Fusion Penalties

We propose a method for estimation in high-dimensional linear models with nominal categorical data. Our estimator, called SCOPE, fuses levels together by making their corresponding coefficients exactly equal. This is achieved using the minimax concave penalty on differences between the order statistics of the coefficients for a categorical variable, thereby clustering the coefficients. We provide an algorithm for exact and efficient computation of the global minimum of the resulting nonconvex objective in the case with a single variable with potentially many levels, and use this within a block coordinate descent procedure in the multivariate case. We show that an oracle least squares solution that exploits the unknown level fusions is a limit point of the coordinate descent with high probability, provided the true levels have a certain minimum separation; these conditions are known to be minimal in the univariate case. We demonstrate the favourable performance of SCOPE across a range of real and simulated datasets. An R package CatReg implementing SCOPE for linear models and also a version for logistic regression is available on CRAN.


Introduction
Categorical data arise in a number of application areas. For example, electronic health data typically contain records of diagnoses received by patients coded within controlled vocabularies and also prescriptions, both of which give rise to categorical variables with large numbers of levels [Jensen et al., 2012]. Vehicle insurance claim data also contain a large number of categorical variables detailing properties of the vehicles and parties involved [Hu et al., 2018]. When performing regression with such data as covariates, it is often helpful, both for improved predictive performance and interpretation of the fit, to fuse the levels of several categories together in the sense that the estimated coefficients corresponding to these levels have exactly the same value.
To fix ideas, consider the following ANOVA model relating response vector Y = (Y 1 , . . . , Y n ) T ∈ R n to categorical predictors X ij ∈ {1, . . . K j }, j = 1, . . . , p: Here the ε i are independent zero mean random errors, µ 0 is a global intercept and θ 0 jk is the contribution to the response of the kth level of the jth predictor; we will later place restrictions on the parameters to ensure they are identifiable. We are interested in the setting where the coefficients corresponding to any given predictor are clustered, so defining s j := |{θ 0 j1 , . . . , θ 0 jK j }|, we have s j K j , at least when K j is large. Note that our setup can include high-dimensional settings where p is large and many of the predictors do not contribute at all to the response: when s j = 1, the contribution of the jth predictor is effectively null as it may be absorbed by the intercept term.

Background and motivation
Early work on collapsing levels together in low-dimensional models of the form (1) focused on performing a variety of significance tests for whether certain sets of parameters were equal [Tukey, 1949, Scott and Knott, 1974, Calinski and Corsten, 1985. A more modern and algorithmic method based on these ideas is Delete or merge regressors , which involves agglomerative clustering based on t-statistics for differences between levels.
The CART algorithm  for building decision trees effectively starts with all levels of the variables fused together and greedily selects which levels to split. One potential drawback of these greedy approaches is that in high-dimensional settings where the search space is very large, they may fail to find good groupings of the levels. The popular random forest procedure  uses randomisation to alleviate the issues with the greedy nature of the splits, but sacrifices interpretability of the fitted model.
An alternative to greedy approaches in high-dimensional settings is using penalty-based methods such as the Lasso [Tibshirani, 1996]. This can be applied to continuous or binary data and involves optimising an objective for which global minimisation is computationally tractable, thereby avoiding some of the pitfalls of greedy optimisation. In contrast to random forest, the fitted models are sparse and interpretable. Inspired by the success of the Lasso and related methods for high-dimensional regression, a variety of approaches have proposed estimating θ 0 = (θ 0 jk ) j=1,...,p, k=1,...,K j and µ 0 via optimising over (µ, θ) a sum of a least squares criterion (µ, θ) := 1 2n and a penalty of the form This is the CAS-ANOVA penalty of . The weights w j,kl can be chosen to balance the effects of having certain levels of categories more prevalent than others in the data. The penalty is an 'all-pairs' version of the fused Lasso and closely related to so-called convex clustering [Hocking et al., 2011, Chiquet et al., 2017. We note that there are several other approaches besides using penalty functions. For instance,  proposes a Bayesian modelling procedure using sparsity-inducing prior distributions to encourage fusion of levels. See also Tutz and Gertheiss [2016] and references therein for a review of other methods including those based on mixture models and kernels. The fact that the optimisation problem resulting from (4) is convex makes the procedure attractive. However, a drawback is that it may not give a desirable form of shrinkage. Indeed, consider the case where p = 1, and dropping subscripts for simplicity, all w kl = 1. This would typically be the case if all levels were equally prevalent. Further suppose for simplicity that the number of levels K is even. Then if the coefficients are clustered into two groups where one contains only a single isolated coefficient, the number of non-zero summands in (4) is only K −1. This almost doubles to 2(K −2) when one of the two groups is of size 2. The extreme case where the two groups are of equal size yields (K/2) 2 non-zero summands. This particular property of all-pairs penalties, which results in them favouring groups of unequal sizes, is illustrated schematically in Figure 1. We can see the impact of this in the following concrete example. Figure 1: Illustration of the number of non-zero summands in (4) when p = 1, K = 16 and coefficients are clustered into two groups of equal size (right), and where one contains a single coefficient (left) and two coefficients (middle).
If the coefficient estimates satisfyθ 1 = · · · =θ 4 <θ 5 = · · · =θ 10 ≤θ k for all k ≥ 11, so the first two groups have distinct coefficients, then moving any coefficient from the first group towards the second, and so increasing the number of estimated groups, actually decreases the penalty contribution in (4). Specifically, if the kth coefficient for some k ∈ {1, . . . , 4} moves toθ k + t for t ∈ [0,θ 5 −θ 4 ] with all other coefficients kept fixed, the penalty contribution decreases by 13t. In this case then, CAS-ANOVA will struggle to keep the groups intact, especially smaller ones. We see this in Figure 2, which shows the result of applying CAS-ANOVA to data generated according to (1) with p = 1, θ 0 as above, n = 20 (so we have a single observation corresponding to each level), and ε i i.i.d.
∼ N (0, 1). There is no value of the tuning parameter λ where the true groups are recovered.
As in the standard regression setting, the bias introduced by all-pairs 1 -type penalties may be reduced by choosing data-adaptive weights analogously to the adaptive Lasso [Zou, 2006], or replacing the absolute value |θ jk − θ jl | by ρ(|θ jk − θ jl |) where ρ is a concave and nondecreasing penalty function [Oelker et al., 2015, Ma andHuang, 2017]. However, this does not address the basic issue of a preference for groups of unequal sizes. Additionally, optimising an objective involving a penalty with O p j=1 K 2 j summands can be computationally challenging, particularly in the case where ρ is not convex, both in terms of runtime and memory.
To help motivate the new approach we are proposing in this paper, let us consider the setting where the predictors are ordinal rather than nominal, so there is an obvious ordering among the levels. In these settings, it is natural to consider a fused Lasso [Tibshirani et al., 2005] penalty of the form where π j is a permutation of {1, . . . , K j } specifying the given order; this is done in Gertheiss and Tutz [2010] who advocate using it conjunction with the all-pairs-type CAS-ANOVA penalty for nominal categories. If however we treat the nominal variable setting as analogous to having ordinal variables with unknown orderings π j , one might initially think of choosing π j corresponding to the order of the estimates θ j := (θ jk ) K j k=1 , such that θ jπ j (k) = θ j(k) , where θ j(k) is the kth smallest entry in θ j . This however leads to what we refer to as the 'range' penalty: Whilst this shrinks the largest and smallest of the estimated coefficients together, the remaining coefficients lying in the open interval between these are unpenalised and so no grouping of the estimates is encouraged, as we observe in Figure 2; see also Oelker et al. [2015] for a discussion of this issue in the context of ordinal variables.

Our contributions and organisation of the paper
Given how all-pairs penalties have an intrinsic and undesirable preference for unequal group sizes, and how the fused Lasso applied to ordered coefficients (6) does not result in grouping of the coefficients, we propose the following solution. Our approach is to use the penalty p j=1 K j −1 k=1 ρ j (θ j(k+1) − θ j(k) ), for concave (and nonconvex) non-decreasing penalty functions ρ j , which, for computational reasons we discuss in Section 3, we base on the minimax concave penalty (MCP) [Zhang, 2010].
In Section 2 we formally introduce our method, which we call SCOPE, standing for Sparse Concave Ordering & Penalisation Estimator. Note that whereas in conventional high-dimensional regression, the use of nonconvex penalties has been primarily motivated by a need to reduce bias in the estimation of large coefficients [Fan and Li, 2001], here the purpose is very different: in our setting a nonconvex penalty is in fact even necessary for shrinkage to sparse solutions to occur (see Proposition 1). Because of these fundamental differences, the rich algorithmic and statistical theory concerning highdimensional regression with nonconvex penalties (see for example Wainwright [2012, 2015], Wang et al. [2014], Fan et al. [2018], Zhao et al. [2018 and references therein) is not directly applicable to our setting.
In Section 3, we therefore introduce a new dynamic programming approach that recovers the global minimum of the resulting objective function exactly in the univariate case, i.e. when p = 1. We then build this into a blockwise coordinate descent approach to tackle the multivariate setting.
In Section 4 we study the theoretical properties of SCOPE and give sufficient conditions for the estimator to coincide with the least squares solution with oracular knowledge of the level fusions in the univariate case. These conditions involve a minimal separation between unequal coefficients that is, up to constant factors, minimax optimal. Our result contrasts sharply with Theorem 2 of Ma and Huang [2017] for an all-pairs nonconvex penalty. The latter instead shows the existence of a local optimum that coincides with the oracle least squares solution. Whilst in conventional high-dimensional regression settings, it is known that under certain conditions, all local optima have favourable properties [Loh and Wainwright, 2015], we note that the separation requirements in Ma and Huang [2017] are substantially weaker than those indicated by the minimax lower bound, and so cannot be extended to a particular local optimum determined by the data; see the discussion following Theorem 5.
We use our univariate result to show that the oracle least squares solution is a fixed point of our blockwise coordinate descent algorithm in the multivariate case. In Section 5 we outline some extensions of our methodology including a scheme for handling settings when there is a hierarchy among the categorical variables. Section 6 contains numerical experiments that demonstrate the favourable performance of our method compared to a range of competitors on both simulated and real data. We conclude with a discussion in Section 7. Further details of our algorithm can be found in the Appendix. The supplementary material contains additional information on the runtime of our algorithm, and an approximate version suitable for very large-scale settings, all the proofs, and additional information on the experiments in Section 6.

SCOPE methodology
Recall that our goal is to estimate parameters (µ 0 , θ 0 ) in model (1). Let us first consolidate some notation. For any θ ∈ R K 1 × · · · × R Kp , we define θ j := (θ jk ) K j k=1 ∈ R K j . We will study the univariate setting where p = 1 separately, and so it will be helpful to introduce some simplified notation for this case, dropping any extraneous subscripts. We thus write K ≡ K 1 , X i ≡ X i1 and ρ ≡ ρ 1 . Additionally, we letȲ k denote the average of the Y i with X i = k: where In order to avoid an arbitrary choice of corner point constraint, we instead impose the following to ensure that θ 0 is identifiable: for all j = 1, . . . , p we have Let Θ j = {θ j ∈ R K j : g j (θ j ) = 0}, and let Θ = Θ 1 × · · · × Θ p . We will construct estimators by minimising over µ ∈ R and θ ∈ Θ an objective function of the form where is the least squares loss function (3) and θ j(1) ≤ · · · ≤ θ j(K j ) are the order statistics of θ j . We allow for different penalty functions ρ j for each predictor in order to help balance the effects of varying numbers of levels K j . The identifiability constraint that θ ∈ Θ ensures that the estimated interceptμ := arg min µQ (µ, θ) satisfiesμ = n i=1 Y i /n. We note that whilst the form of the identifiability constraint would not have a bearing on the fitted values of unregularised least squares regression, this is not necessarily the case when regularisation is imposed. For example, consider the simple univariate setting with p = 1 and the corner point constraint θ 1 = 0. Then the fitted value for an observation with level 1 would simply be the averageȲ 1 , coinciding with that of unpenalised least squares. However the fitted values with observations with other level k ≥ 2 would be subject to regularisation and in general be different toȲ k . This inequitable treatment of the levels is clearly undesirable as they may have been labelled in an arbitrary way. Our identifiability constraint treats the levels more symmetrically, but also takes into account the prevalence of levels, so the fitted values corresponding to more prevalent levels effectively undergo less regularisation.
As the estimated interceptμ does not depend on the tuning parameters, we define We will take the regularisers ρ j : [0, ∞) → [0, ∞) in (9) to be concave (and nonconvex); as discussed in the introduction and formalised in Proposition 1 below, a nonconvex penalty is necessary for fusion to occur.
We base the penalties ρ j : [0, ∞) → [0, ∞) on the MCP (Minimax Concave Penalty) [Zhang, 2010]: where (u) + = u1 {u≥0} . This is a piecewise quadratic function with gradient λ at 0 and flat beyond γλ. For computational reasons which we discuss in Section 3, the simple piecewise quadratic form of this is particularly helpful. In the multivariate case we take ρ j = ρ γ,λ j with λ j = λ K j . This choice of scaling is motivated by requiring that when θ 0 = 0 we also haveθ = 0 with high probability; see Lemma 10 in the Supplementary material. We discuss the choice of the tuning parameters λ and γ in Section 3.3, but first turn to the problem of optimising (9).

Computation
In this section we include details of how SCOPE is computed. Section 3.1 motivates and describes the dynamic programming algorithm we use to compute global minimiser of the SCOPE objective, which is highly non-convex. Section 3.2 contains details of how this is used to solve the multivariate objective by embedding it within a blockwise coordinate descent routine. Discussion of practical considerations is contained in Section 3.3.

Preliminaries
We now consider the univariate case (p = 1) and explain how the solutions are computed. In this case, we may rewrite the least squares loss contribution to the objective function in the following way.
where w k = n k /n. Thus the optimisation problem (9) can be written equivalently aŝ suppressing the dependence of the MCP ρ on tuning parameters γ and λ. In fact, it is straightforward to see that the constraint that the solution lies in Θ will be automatically satisfied, so we may replace Θ with R K . Two challenging aspects of the optimisation problem above are the presence of the nonconvex ρ and the order statistics. The latter however are easily dealt with using the result below, which holds more generally whenever ρ is a concave function.
Proposition 2. Consider the univariate optimisation (11) with ρ any concave function such that a minimiserθ exists. If for k, l we haveȲ k >Ȳ l , thenθ k ≥θ l .
This observation substantially simplifies the optimisation: after re-indexing such thatȲ 1 ≤ Y 2 ≤ · · · ≤Ȳ K , we may re-express (11) as, θ ∈ arg min θ: We use the following intermediate functions to structure the algorithm: for k = 2, . . . , K; here sarg min refers to the smallest minimiser in the case that it is not unique.
Invariably however this will be unique, as the following result indicates.
Proposition 3. The set of (Ȳ k ) K k=1 that yields distinct solutions to (11) has Lebesgue measure zero as a subset of R K .
We will thus tacitly assume uniqueness in some of the discussion that follows, though this is not required for our algorithm to return a global minimiser. Observe now thatθ K is the minimiser of the univariate objective function f K : indeed for k ≥ 2, Furthermore, we haveθ K−1 = b K (θ K ), and more generallyθ k = b k+1 (θ k+1 ) for k = K −1, . . . , 1. Thus provided f K can be minimised efficiently (which we shall see is indeed the case), given this and the functions b 2 , . . . , b K we can iteratively computeθ K ,θ K−1 , . . . ,θ 1 . In order to make use of these properties, we must be able to compute f K and the b k efficiently; we explain how to do this in the following subsection.
3.1.2 Computation of f K and b 2 , . . . , b K The simple piecewise quadratic form of the MCP-based penalty is crucial to our approach for computing the f K and the b k . Some important consequences of this piecewise quadratic property are summarised in the following lemma.
Lemma 4. For each k, (i) f k is continuous, coercive and piecewise quadratic with finitely many pieces; (ii) b k is piecewise linear with finitely many pieces; Properties (i) and (ii) above permit exact representation of f k and b k with finitely many quantities. The key task then is to form the collection of intervals and corresponding coefficients of quadratic functions for g k (θ k+1 ) := min given a similar piecewise quadratic representation of f k ; and also the same for the linear functions composing b k . A piecewise quadratic representation of f k+1 would then be straightforward to compute, and we can iterate this process. To take advantage of property (iii) above, in computing g k (θ k+1 ) we can separately search for minimisers at stationary points in (−∞, θ k+1 ) and compare the corresponding function values with f k (θ k+1 ); the fact that we need only consider potential minimisers at points of differentiability will simplify things as we shall see below.
Suppose I k,1 , . . . , I k,m(k) are intervals that partition R (closed on the left) and q k,1 , . . . , q k,m(k) are corresponding quadratic functions such that f k (θ k ) = q k,r (θ k ) for θ k ∈ I k,r . Let us writẽ We may then express f k as f k (θ k ) = min rqk,r (θ k ). We can also express the penalty ρ = ρ γ,λ in a similar fashion. Let Then ρ(x) = min tρt (x) for x ≥ 0. Let D k be the set of points at which f k is differentiable. We then have, using Lemma 4 (iii) that wheremin denotes the minimum if it exists and ∞ otherwise. The fact that in the inner minimisation we are permitted to consider only points in D k simplifies the form of We show in Section A.1 of the Appendix that this is finite only on an interval and there takes the value of a quadratic function; coefficients for this function and the interval endpoints have closed form expressions that are elementary functions of the coefficients and intervals corresponding toq k,r . With this, we have an explicit representation of g k as the minimum of a collection of functions that are quadratic on intervals and ∞ everywhere else. Let us refer to these intervals (closed on the left) and corresponding quadratic functions as J k,1 , . . . , J k,n(k) and p k,1 , . . . , p k,n(k) respectively. In order to produce a representation of f k+1 for use in future iterations, we must express g k as a collection of quadratics defined on disjoint intervals. To this end, define for each x ∈ R the active set at x, A(x) = {r : x ∈ J k,r }. Note that the endpoints of the intervals J k,r are the points where the active set changes and it is thus straightforward to determine A(x) at each x. Let r(x) be the index such that g k (x) = p k,r(x) (x). For large negative values of x, A(x) will contain a single index and for such x this must be r(x). Consider also for each r ∈ A(x) \ {r(x)}, the horizontal coordinate x of the first intersection beyond x (if it exists) between p k,r and p k,r(x) . We refer to the collection of all such tuples (x , r) as the intersection set at x and denote it by N (x). Given r(x), N (x) can be computed easily. The intersection set N (x) then in turn helps to determine the smallest x > x where r(x ) = r(x) changes, that is the next knot of g k beyond x, as we now explain. Suppose at a point x old , we have computed r old = r(x old ). We set x cur = x old and perform the following.
2. If there are no changes in the active set between x cur and x int , we have found the next knot point at x int and r int = r(x int ).
3. If instead the active set changes, move x cur to the leftmost change point. We have that r(x) = r old for x ∈ [x old , x cur ). To determine if r(x) changes at x cur , we check if (i) r old leaves the active set at x cur , so r old / ∈ A(x cur ), or (ii) some r new enters the active set at x cur and 'beats' r old , so r new ∈ A(x cur ) \ A(x old ) and p k,rnew (x cur + ) < p k,r old (x cur + ) for > 0 sufficiently small.
If either hold x cur is a knot and r(x cur ) may be computed via r(x cur ) = arg min r∈A(xcur) p k,r (x cur ).
If neither hold, we conclude that r(x cur ) = r old and go to step 1 once more.
Hence we can proceed from one knot of g k to the next by comparing the values and intersections of a small collection of quadratic functions, and thereby form a piecewise quadratic representation of g k in a finite number of steps. Figure 3 illustrates the steps outlined above. The pieces of b k may be computed in a similar fashion. We note there are several modifications that can speed up the algorithm: for example, for each r, u k,r,2 (17) is a constant function where it is finite (see p k,2 in the figure), and these can be dealt with more efficiently. For further details including pseudocode see Section A.2 of the Appendix. Figure 3: Illustration of the optimisation problem and our algorithm, to be interpreted with reference to steps 1, 2, 3 in the main text. Shading indicates regions where the active set, displayed at the bottom of the plot, is invariant, and vertical dotted lines signify changes. Dotted curves correspond to parts of quadratic functions p k,l lying outside their associated intervals J k,l . At x old , we have r(x old ) = 1, A(x old ) = {1, 2} and N (x old ) = {(x (1) int , 2)}. Since the active set changes between x old and x (1) int , we move x cur to the first change point P and see neither (i) nor (ii) occur. We therefore return to step 1 and compute N (x cur ) which additionally contains (x (2) int , 2). As the active set is unchanged between x cur and x (2) int , we have determined the next knot point x (2) int and minimising quadratic p k,3 .
In summary, our algorithm produces a piecewise quadratic representation of f K , which we can minimise efficiently to obtainθ K . We also have piecewise linear representations of functions b 2 , . . . , b K through which we may iteratively obtainθ k = b k+1 (θ k+1 ) for k = K − 1, . . . , 1.
It seems challenging to obtain meaningful bounds on the number of computations that must be performed at each stage of this process in terms of parameters of the data. However, to give an indication of the scalability of this algorithm, we ran a simple example with 3 true levels and found that with 50 categories the runtime was under 10 −3 seconds; with 2000 categories it was still well under half a second. More details on computation time can be found in Sections S1.3 and S3.2 of the Supplementary material. In Section S1.4 of the Supplementary material, we describe an approximate version of the algorithm that can be used for fast computation in very large-scale settings.

Multivariate model
Using our dynamic programming algorithm for the univariate problem, we can attempt to minimise the objective (9) for the multivariate problem using block coordinate descent. This has been shown empirically to be a successful strategy for minimising objectives for high-dimensional regression with nonconvex penalties such as the MCP , Mazumder et al., 2011, Breheny and Huang, 2015, and we take this approach here. Considering the multivariate case, we iteratively minimise the objective Q over θ j := (θ jk ) K j k=1 ∈ Θ j keeping all other parameters fixed. Then for a given (γ, λ) and initial estimateθ (0) ∈ Θ, we repeat the following until a suitable convergence criterion is met: 1. Initialise m = 1, and set for i = 1, . . . , n 2. For j = 1, . . . , p, compute 3. Increment m → m + 1.
We define a blockwise optimum of Q to be anyθ ∈ Θ, such that for each j = 1, . . . , p, This is equivalent toθ being a fixed point of the block coordinate descent algorithm above. Provided γ > 0, Q is continuous in θ. in (19) are unique for all j and m (which will invariably be the case when the responses are realisations of continuous random variables; see Proposition 3), then all limit points of the sequence (θ (m) ) ∞ m=0 are blockwise optima.

Practicalities
In practice the block coordinate descent procedure described above must be performed over a grid of (γ, λ) values to facilitate tuning parameter selection by cross-validation. In line with analogous recommendations for other penalised regression optimisation procedures Huang, 2011, Friedman et al., 2010], we propose, for each fixed γ, to iteratively obtain solutions for an exponentially decreasing sequence of λ values, warm starting each application of block coordinate descent at the solution for the previous λ. It is our experience that this scheme speeds up convergence and helps to guide the resulting estimates to statistically favourable local optima, as has been shown theoretically for certain nonconvex settings [Wang et al., 2014]. The grid of γ values can be chosen to be fairly coarse as the solutions appear to be less sensitive to this tuning parameter; in fact fixing γ ∈ {8, 32} yields competitive performance across a range of settings (see Section 6). The choice γ ↓ 0, which mimics the 0 penalty, has good statistical properties (see Theorem 5 and following discussion). However the global optimum typically has a smaller basin of attraction and can be prohibitively hard to locate, particularly in low signal to noise ratio settings where larger γ tends to dominate.

Theory
In this section, we study the theoretical properties of SCOPE. Recall our model for i = 1, . . . , n, where θ 0 ∈ Θ. We will assume the errors (ε i ) n i=1 have mean zero, are independent and sub-Gaussian with parameter σ. Let Θ 0 = θ ∈ Θ : θ jk = θ jl whenever θ 0 jk = θ 0 jl for all j and define the oracle least squares estimatê This is the least squares estimate of θ 0 with oracular knowledge of which categorical levels are fused in θ 0 . Note that in the case where the errors have equal variance v 2 , the expected mean squared prediction error ofθ 0 satisfies with equality whenθ 0 is unique.
Our results below establish conditions under whichθ 0 is a blockwise optimum (20) of the SCOPE objective function Q (9), or in the univariate case when this in fact coincides with SCOPE. The minimum differences between the signals defined for each j by will play a key role. If all components of θ 0 j are equal we take ∆(θ 0 j ) to be ∞. We also introduce n j,min = min k n jk , these latter two quantities are the minimum and maximum number of observations corresponding to a set of fused levels in the jth predictor respectively.

Univariate model
We first consider the univariate case, where as usual we will drop the subscript j for simplicity.
The following result establishes conditions for recovery of the oracle least squares estimate (22).
For a choice of the tuning parameters (γ, λ) with γ ≤ ηs and λ such that equality holds in (24), we have, writing ∆ ≡ ∆(θ 0 ), thatθ =θ 0 with probability at least where c is an absolute constant. The quantity η reflects how equal the number of observations in the true fused levels are: in settings where the prevalences of the underlying true levels are roughly equal, we would expect this to be closer to 1. Consider now an asymptotic regime where K, s and 1/∆ are allowed to diverge with n, n min n/K, so all levels have roughly the same prevalence, and η is bounded away from zero, so all true underlying levels also have roughly the same prevalence. Then in order forθ =θ 0 with high probability, we require ∆ σ K log(K)/n. This requirement cannot be weakened for any estimator; this fact comes as a consequence of minimax lower bounds on mis-clustering errors in Gaussian mixture models [Lu and Zhou, 2016, Theorem 3.3].
We remark that our result here concerning properties of the global minimiser of our objective is very different from existing results on local minimisers of objectives involving all-pairs-type penalties. For example, in the setting above where K = n, Theorem 2 of Ma and Huang [2017] gives that provided s = o(n 1/3 (log n) −1/3 ) and ∆ σs 3/2 n −1/2 log(n), there exists a sequence of local minimisers converging to the oracle least-squares estimate with high probability. This is significantly weaker than the condition ∆ σ log(n) required for any estimator to recover oracle least-squares in this setting, illustrating the substantial difference between results on local and global optima here.

Multivariate model
When the number of variables is p > 1, models can become high-dimensional, with ordinary least squares estimation failing to provide a unique solution. We will however assume that the solution for θ ∈ Θ 0 to is unique, which occurs if and only if the oracle least squares estimate (22) is unique. In this case, we note thatθ 0 = AY for a fixed matrix A. A necessary condition for this is that Our result below provides a bound on the probability that the oracle least squares estimate is a blockwise optimum of the SCOPE objective (9) with ρ j = ρ γ j ,λ j . This is much more meaningful than an equivalent bound forθ 0 to be a local optimum as the number of local optima will be enormous. In general though there may be several blockwise optima, and it seems challenging to obtain a result giving conditions under which our blockwise coordinate descent procedure is guaranteed to converge toθ 0 . Our empirical results (Section 6) however show that the fixed points computed in practice tend to give good performance.
Theorem 6. Consider the model (21) and assumeθ 0 = AY . Suppose that there exists η ∈ (0, 1] such that η/s j ≤ n 0 j,min /n ≤ n 0 j,max /n ≤ 1/ηs j for all j = 1, . . . , p. Let γ * j = min{γ j , ηs j } and γ * j = max{γ j , ηs j }. Further suppose that Then letting c min := (max l (AA T ) ll ) −1 , with probability at least the oracle least squares estimateθ 0 is a blockwise optimum of (9). Now suppose γ j ≤ ηs j and λ j are such that equality holds in (26) for all j. Then writing K max = max j K j , n min = min j n j,min and ∆ min = min j ∆(θ 0 j ), we have thatθ 0 is a blockwise optimum of (9) with probability at least where c is an absolute constant. Consider now an analogous asymptotic regime to that described in the previous section for the univariate case. Specifically assume n min n/K max and c min n min for simplicity. We then see that in order forθ 0 to be a blockwise optimum with high probability, it is sufficient that ∆ min σ K max log(K max p)/n.

Extensions
In this section, we describe some extensions of our SCOPE methodology.
Continuous covariates. If some of the covariates are continuous rather than categorical, we can apply any penalty function of choice to these, and perform a regression by optimising the sum of a least squares objective, our SCOPE penalty and these additional penalty functions, using (block) coordinate descent. For example, consider the model (1) with the addition of d continuous covariates. Let Z ∈ R n×d be the centred design matrix for these covariates with ith row Z i ∈ R d . One can fit a model with SCOPE penalising the categorical covariates, and the Lasso with tuning parameter α > 0 penalising the continuous covariates, resulting in the following objective over β ∈ R d and θ ∈ Θ: This sort of integration of continuous covariates is less straightforward when attempting to use tree-based methods to handle categorical covariates, for example.
Generalised linear models. Sometimes a generalised linear model may be appropriate. Although a quadratic loss function is critical for our exact optimisation algorithm described in Section 3.1, we can iterate local quadratic approximations to the loss term in the objective and minimise this. This results in a proximal Newton algorithm and is analogous to the standard approach for solving 1 -penalised generalised linear models [Friedman et al., 2010, Section 3]. An implementation of this scheme in the case of logistic regression for binary responses is available in the accompanying R package CatReg. We remark that when computing logistic regression models with a SCOPE penalty it is advisable to use a larger value of γ than with a continuous response to aid convergence of the proximal Newton step; we recommend a default setting of γ = 100. In Section 6.2 we use the approach described above to perform a logistic regression using SCOPE on US census data.
Hierarchical categories. Often certain predictors may have levels that are effectively subdivisions of the levels of other predictors. Examples include category of item in e-commerce or geographical data with predictors for continent, countries and district. For simplicity, we will illustrate how such settings may be dealt with by considering a case with two predictors, but this may easily be generalised to more complex hierarchical structures. Suppose there is a partition G 1 ∪ · · · ∪ G K 1 of {1, . . . , K 2 } such that for all k = 1, . . . , K 1 , so the levels of the second predictor in G k represent subdivisions of kth level of the first predictor. Let K 2k := |G k | and let θ 2k refer to the subvector (θ 2l ) l∈G k for each k = 1, . . . , K 1 , so components of θ 2k are the coefficients corresponding to the levels in G k . Also let θ 2k(r) denote the rth order statistic within θ 2k . It is natural to encourage fusion among levels within G k more strongly than for levels in different elements of the partition. To do this we can modify our objective function so the penalty takes the form We furthermore enforce the identifiability constraints that K 1 l=1 n 1l θ 1l = 0 and l∈G k n 2l θ 2l = 0 for all k = 1, . . . , K.
As well as yielding the desired shrinkage properties, an additional advantage of this approach is that the least squares criterion is separable in θ 21 , . . . , θ 2K 1 so the blockwise update of θ 2 can be performed in parallel. This can lead to a substantial reduction in computation time if K 2 is large.

Numerical experiments
In this section we explore the empirical properties of SCOPE. We first present results on the performance on simulated data, and then in Sections 6.2 to 6.5 present analyses and experiments on US census data, insurance data and COVID-19 modelling data. We denote SCOPE with a specific choice of γ as SCOPE-γ, and write SCOPE-CV to denote SCOPE with a cross-validated choice of γ. SCOPE solutions are computed using our R [R Core Team, 2020] package CatReg [Stokell, 2021], using 5-fold cross-validation to select λ for all examples except those in Section 6.5. We compare SCOPE to linear or logistic regression where appropriate and a range of existing methods, including CAS-ANOVA  (4), and an adaptive version where the weights w j,kl are multiplied by a factor proportional to the |θ init jk −θ init jl | −1 , whereθ init is an initial CAS-ANOVA estimate. For these methods the tuning parameter λ was also selected by 5-fold cross-validation. As well as this, we include Delete or merge regressors (DMR)  and Bayesian effect fusion (BEF)  in some experiments. With the former, models were fitted using DMRnet [Prochenka-So ltys and Pokarowski, 2018] and selected by 5-fold cross-validation where possible; otherwise an information criterion was used. With BEF, coefficients were modelled with a Gaussian mixture model with posterior mean estimated using 1000 samples using effectFusion . We also include comparison to the tree-based approaches CART  and random forests (RF) . Lastly, in some experiments, models were also fitted using the Lasso [Tibshirani, 1996]. CART was implemented using rpart [Therneau and Atkinson, 2019] with pruning according to the one standard error rule. Random forests and Lasso were implemented using the default settings in randomForest  and glmnet  packages respectively. For full details of the specific versions of these methods and software used in the numerical experiments, see Section S3.1 of the Supplementary material.

Simulations
We simulated data according to the model (1) with the covariates X ij generated randomly in the following way. We first drew (W ij ) p j=1 from a multivariate N p (0, Σ) distribution where the covariance matrix Σ had ones on the diagonal. The off-diagonal elements of Σ were chosen such that U ij := Φ −1 (W ij ) had corr(U ij , U ik ) = ρ for j = k. The marginally uniform U ij were then quantised this to give X ij = 24U ij , so the number of levels K j = 24.
The errors ε i were independently distributed as N (0, σ 2 ). The performance of SCOPE and competitor methods was measured using mean squared prediction error on 10 5 new (noiseless) observations generated in the same way as the training data, and final results are averages over 500 draws of training and test data. We considered various settings of (n, p, ρ, θ 0 , σ 2 ) below with low-dimensional and high-dimensional scenarios considered in Sections 6.1.1 and 6.1.2 respectively. The coefficient vectors for each experiment are specified up to an additive constant, which is required to satisfy the identifiability condition (8).
We measured predictive performance by the mean squared prediction error (MSPE) given by where g is the true regression function,ĝ an estimate, and the expectation is taken over the covariate vector x.

Low-dimensional experiments
Results are presented for three settings with n = 500, p = 10 given below.
1. θ 0 j = (  Each of these experiments were performed with noise variance σ 2 = 1, 6.25, 25 and 100. Note that the variance of the signal varies across each setting, and signal-to-noise ratio (SNR) for each experiment is displayed in Table 1. Methods included for comparison were SCOPE-8, SCOPE-32, SCOPE-CV, linear regression, vanilla and adaptive CAS-ANOVA, DMR, Bayesian effect fusion, CART and random forests. Also included are the results from the oracle least squares estimator (22).
Results are shown in Table 1 and further details are given in Section S3.2.1 of the Supplementary material. Across all experiments, SCOPE with a cross-validated choice of γ exhibits prediction performance at least as good as the optimal approaches, and in all but the lowest noise settings performs better than the other methods that were included. In these exceptions, we see that fixing γ to be a small value (corresponding to high-concavity) provides leading performance.
In these low noise settings, we see that the methods based on first estimating the clusterings of the levels and then estimating the coefficients without introducing further shrinkage, such as DMR or Bayesian effect Fusion, perform well. However they tend to struggle when the noise is larger. In contrast the tree-based methods perform poorly in low noise settings but exhibit competitive performance in high noise settings.

High-dimensional experiments
We considered 8 settings as detailed below, each with n = 500, p = 100 and simulated 500 times. Models were fitted using SCOPE-8, SCOPE-32, SCOPE-CV, DMR, CART, Random forests and the Lasso. Table 2 gives the mean squared prediction errors across each of the settings.
As well as prediction performance, it is interesting to see how the methods perform in terms of variable selection performance. With categorical covariates, there are two potential ways of evaluating this. The first is to consider the number of false positives and false negatives across the p = 100 categorical variables, defining a variable j to have been selected ifθ j = 0. These results are shown in Table 3. This definition of a false positive can be considered quite conservative; typically one can find that often the false signal variables have only two levels, each with quite small coefficients. This means that the false positive rate can increase substantially with only a small increase in the dimension of the estimated linear model.
The second is to see within the signal variables (i.e., the j for which θ 0 j = 0), how closely the estimated clustering resembles the true structure. To quantify this, we use the adjusted Rand index [Hubert and Arabie, 1985]. This is the proportion of all pairs of observations that are either (i) in different true clusters and different estimated clusters, or (ii) in the same true cluster and estimated cluster; this is then corrected to ensure that its value is zero when exactly one of the clusterings is 'all-in-one'. In Table 4 we report the average adjusted Rand index over the true signal variables in each setting.     Further details can be found in Section S3.2.2 of the Supplementary material. In particular we include a table with the distribution of cross-validated choices of γ (from a grid {4, 8, 16, 32, 64}) for each experimental setting. Note that a choice of γ = 4 is close to the setting of γ = 3 recommended in Zhang [2010], though the problem of categorical covariates is very different in nature than the vanilla variable selection problem considered there. Our results there suggest that for SCOPE, a larger value of γ is preferable across a range of settings, which is also visible in the comparison between γ = 8 and γ = 32 in Table 2.
Across all the settings in this study, SCOPE performs better than any of the other methods included. This is regardless of which of the three γ regimes is chosen, although cross-validating γ gives the strongest performance overall. Comparing the results for γ = 8 and γ = 32 suggests that a larger (low-concavity) choice of γ is preferable for higher-dimensional settings. In setting 6, we see from Tables 3 and 4 that SCOPE obtains the true underlying groupings of the coefficients and obtains the oracle least-squares estimate in every case, giving these striking results. This is also achieved for some of the experiments in setting 5. In contrast, DMR, which initially applies a group Lasso [Yuan and Lin, 2006] to screen the categorical variables and give a low-dimensional model, necessarily misses some signal variables in this first stage and hence struggles here.

Adult dataset analysis
The Adult dataset, available from the UCI Machine Learning Repository [Dua and Graff, 2019], contains a sample of 45 222 observations based on information from the 1994 US census. The binary response variable is 0 if the individual earns at most $50 000 a year, and 1 otherwise. There are 2 continuous and 8 categorical variables; some such as 'native country' have large numbers of levels, bringing the total dimension to 93. An advantage of using SCOPE here over black-box predictive tools such as Random forests is the interpretability of the fitted model.
In Table 5, we show the 25-dimensional fitted model. Within the Education category, we see that six distinct levels have been identified. These agree almost exactly with the stratification one would expect, with all school dropouts before 12th grade being grouped together at the lowest level. Here we assess performance in the challenging setting when the training set is quite small by randomly selecting 1% (452) of the total observations for training, and using the remainder as a test set. Any observations containing levels not in the training set were removed. Models were fitted with SCOPE-100, SCOPE-250, logistic regression, vanilla and adaptive CAS-ANOVA, DMR, Bayesian effect fusion, CART and random forests.
We see that both SCOPE-100 and SCOPE-250 are competitive, with CART and Random forests also performing well, though the latter two include interactions in their fits. CAS-ANOVA also performs fairly well, the misclassification error is larger that for both versions of SCOPE, and the average fitted model size is larger.

Adult dataset with artificially split levels
To create a more challenging example, we artificially created additional levels in the Adult dataset as follows. For each categorical variable we recursively selected a level with probability proportional to its prevalence in the data and then split it into two by appending "-0" or "-1" to the level for each observation independently and with equal probabilities. We repeated this until the total number of levels reached m times the original number of levels for that variable for m = 2, 3, 4. This process simulates for example responses to a survey, where different respondents might answer 'US', 'U.S.', 'USA', 'U.S.A.', 'United States' or 'United States of America' to a question, which would naively all be treated as different answers. We used 2.5% (1130) of the observations for training and the remainder for testing and applied SCOPE with γ = 100 and logistic regression. Results were averaged over 250 training and test splits. Figure 5 shows that as the number of levels increases, the misclassification error of SCOPE increases only slightly and the fitted model dimension remains almost unchanged, whereas both increase with m for logistic regression.

Insurance data example
The Prudential Life Insurance Assessment challenge was a prediction competition run on Kaggle [2015]. By more accurately predicting risk, the burden of extensive tests and check-ups for life insurance policyholders could potentially be reduced. For this experiment, we use the training set that was provided for entrants of the competition.
We removed a small number of variables due to excessive missingness, leaving 5 continuous variables and 108 categorical variables, most with 2 or 3 levels but with some in the hundreds (and the largest with 579 levels). Rather than using the response from the original dataset, which is ordinal, to better suit the regression setting we are primarily concerned with in this work, we artificially generated a continuous response. To construct this signal, firstly 10 of the categorical variables were selected at random, with probability proportional to the number of levels. For the jth of these, writing K j for the number of levels, we set s j := 2 + 1 2 log K j and assigned each level a coefficient in 1, . . . , s j uniformly at random, thus yielding s j true levels. The coefficients for the 5 continuous covariates were generated as draws from N 5 (0, I). The response was then scaled to have unit variance, after which standard normal noise was added. We used 10% (n = 5938) of the 59 381 total number of observations for training, and the remainder to compute an estimated MSPE (28) by taking an average over these observations. We repeated this 1000 times, sampling 10% of the observations and generating the coefficients as above anew in each repetition. The average mean squared prediction errors achieved by the various methods under comparison are given in Figure 6. We see that SCOPE with a cross-validated choice of γ performs best, followed by the Lasso and SCOPE-32.

COVID-19 Forecast Hub example
As well as the prediction performance experiments in the rest of this section, we include an exploratory data analysis example based on data relating to the ongoing (at time of writing) global COVID-19 pandemic. The COVID-19 Forecast Hub [2020] '. . . serves as a central repository of forecasts and predictions from over 50 international research groups.' A collection of different research groups publish forecasts every week of case incidence in each US state for some number of weeks into the future.
In order to understand some of the difficulties of this challenging forecasting problem, we fitted an error decomposition model of the form log 1 + cases w,l 1 + est.cases m,t,w,l = α 0 + α m,t + β w,l + η m,t,w,l , where w is the week that the forecast is for, l is the state, m indexes the forecasting model, t is the 'target' number of weeks in the future the forecast is for, η m,t,w,l is an error term, and cases w,l and est.cases m,t,w,l are the observed and estimated cases respectively. This decomposition allows an interaction term between time and location, which is important given that the pandemic was known to be more severe at different times for different areas. An interaction between model and forecasting distance was also included in order to capture the effect of some models potentially being more 'short-sighted' than others. The inclusion of the +1 on the left-hand side is to avoid numerators or denominators of zero. We used data from 6 April 2020 to 19 October 2020, giving a total of 100 264 (m, t, w, l)tuples. We applied a SCOPE penalty with γ = 8 to β w,l , which had 1428 levels. The α m,t coefficients, which amounted to 170 levels, were left unpenalised. The additional tuning parameter λ was selected using the Extended Bayesian Information Criterion [Chen and Chen, 2008] rather than cross-validation, as it was more suited to this sort of exploratory analysis on data with a chronological structure.
The resulting estimatesβ w,l had 8 levels. We measured the 'similarity' of two US states l a and l b over a period of time by computing the proportion of weeks at which their estimates   β w,la =β w,l b coincided. The similarity matrix presented in Figure 7 was constructed based on the second 'wave' of the epidemic which occurred in Summer 2020, with clusters identified by applying spectral clustering on the similarity matrix and plotted in order of decreasing withincluster median pairwise similarity.
The resulting clusters are at once interpretable and interesting. Roughly speaking, the top 3 clusters ('top' when ordered according to median pairwise within-cluster agreement) correspond to states that experienced notable pandemic activity in the second, first, and third 'waves' of the U.S. coronavirus pandemic, respectively. The first cluster features several southern States (e.g., Georgia, Florida, Texas) which experienced a surge of COVID cases in June-July. The second cluster features east coast states (e.g., New Jersey and New York) which experienced an enormous pandemic toll in April-May. And the third features midwestern states (e.g., Kentucky, Indiana, Nebraska) which had upticks most recently in September-October.

Discussion
In this work we have introduced a new penalty-based method for performing regression on categorical data. An attractive feature of a penalty-based approach is that it can be integrated easily with existing methods for regression with continuous data, such as the Lasso. Our penalty function is nonconvex, but in contrast to the use of nonconvex penalties in standard high-dimensional regression problems, the nonconvexity here is necessary in order to obtain sparse solutions, that is fusions of levels. Whilst computing the global optimum of nonconvex problems is typically very challenging, for the case with a single categorical variable with several hundred levels, our dynamic programming algorithm can typically solve the resulting optimisation problem in less than a second on a standard laptop computer. The algorithm is thus fast enough to be embedded within a block coordinate descent procedure for handling multiple categorical variables.
We give sufficient conditions for SCOPE to recover the oracle least squares solution when p = 1 involving a minimal separation between unequal coefficients that is optimal up to constant factors. For the multivariate case where p > 1, we show that oracle least squares is a fixed point of our block coordinate descent algorithm, with high probability.
Our work offers several avenues for further work. On the theoretical front, it would be interesting to obtain guarantees for block coordinate descent to converge to a local optimum with good statistical properties, a phenomenon that we observe empirically. On the methodology side, it would be useful to generalise the penalty to allow for clustering multivariate coefficient vectors; such clustering could be helpful in the context of mixtures of regressions models, for example.

A.1 Candidate minimiser functions
In this section we give explicit forms of the functions p k,r as defined in Section 3.1. We write q k,r (x) = a r x 2 + b r x + c r for simplicity, suppressing the subscript k. For S ⊆ R and a, b ∈ R, we write aS + b for the set {ax + b : x ∈ S}.
Recall from Section 3.1 that u k,r,t (θ k+1 ) :=min For a function f : R → R ∪ {∞}, we denote the effective domain of f by For each r = 1, . . . , m(k), there are cases corresponding to t = 1 and t = 2. The formulas are as follows: The second case is with dom u k,r,2 = − br 2ar + γλ, ∞ if a r > 0 and − b r /2a r ∈ I k,r ∅ otherwise.
Here, if g k (θ k+1 ) = u k,r,2 (θ k+1 ), then Considering (16), we see that we can also have the case where g k (θ k+1 ) = f k (θ k+1 ). Thus we can form the set of quadratics p k,r and associated intervals as the set of u k,r,t as above for t = 1, 2 and the q k,r themselves. Note that when g k (θ k+1 ) = q k,r (θ k+1 ), we have b k (θ k+1 ) = θ k+1 .

A.2 Algorithm details
Algorithm 1 describes in detail how the optimisation routine works. In the algorithm we make use of the following objects: • for x ∈ R, A(x) is the active set at x; • E is the set of points at which the active set changes; • N (x) is the intersection set at x; if min{y : (y, r) ∈ N (x)} < min E then 3: (y * , r * ) = arg min{y : (y, r) ∈ N (x)} U = U ∪ {([x, y * ), r(x))}, x =x = y * , r(x) = r * N (x) = ∅, for any intersection between p k−1,r(x) and any p k−1,r with r ∈ A(x)\{r(x)} at location y > x, set N (x) = N (x) ∪ {(y, r)}.

13:
For any intersection between r(x) and any r ∈ A(y * )\A(x) at location y > x, end if 11: end if Output: h * • U is a set of tuples (I, r) where I ⊆ R is an interval and r is an integer, which is dynamically updated as the algorithm progresses.
See Section 3.1.2 for definitions of the sets above. We also use the convention that if x = −∞ then [x, y) = (−∞, y). All of the p k,1 , . . . , p k,m(k) and J k,m are computed at the start of each iterate k. We then initialise the set of all of the end-points of the intervals J k−1,1 , . . . , J k−1,n(k) .
Here x can be thought of as the 'current position' of the algorithm;x is used to store when the minimising function p k−1,r(x) last changed. We initialisex = −∞ and x = −1 + max{y ∈ I k−1,1 : f k−1 (y − ) ≤ 0}. This choice of x ensures that the active set A(x) contains only one element (as mentioned in Section 3.1); this will always be the index corresponding to the functionq k−1,1 .
We initialise the output set U = ∅, which by the end of this algorithm will be populated with the functionsq k,1 , . . . ,q k,m(k) and their corresponding intervals I k,1 , . . . , I k,m(k) that partition R. Finally, we initialise the set N (x) which will contain the intersections between p k−1,r(x) and other functions in the active set. As the active set begins with only one function, we set N (x) = ∅.
As mentioned in Section 3.1, there are several modifications that can speed up the algorithm. One such modification follows from the fact that for each r, u k,r,2 is a constant function over its effective domain, and their effective domain is a semi-infinite interval (see Section A.1 of the Appendix for their expressions). Therefore, for a given point x ∈ R, we can remove all such functions from A(x) except for the one taking the minimal value.
We also note that in Algorithm 1, the set N (x) is not recomputed in its entirety at every point x at which A(x) is updated, as is described in Section 3.1. Line 13 shows how sometimes N (x) can instead be updated by adding or removing elements from it. Often, points 3 (i) and 3 (ii) from the description in the Section 3.1 will coincide, and in such instances some calls to ChooseFunction (Algorithm 2) can be skipped.

Supplementary material
This supplementary material is organised as follows. In Section S1 we include further details of our algorithm and the proofs of results in Sections 2 & 3. The proofs of Theorems 5 and 6 along with a number of lemmas they require can be found in Section S2. Section S3 contains information regarding simulation settings and additional results for the experiments in Section 6 of the main paper. S1 Additional algorithmic details S1.1 Remarks on constrained and unconstrained formulations of the univariate objective It is clear why the identifiability constraint (8) is important when we consider the multivariate problem in Section 3.2. However, for the univariate problem, both constrained and unconstrained formulations of the objective can be clearly defined: As discussed in Section 3.1.1, we can enlarge the feasible set in (30) to be all of R K : similarly to the observation that k w kθ u k =μ = k w kȲk , the minimiser of (30) over all of R K will always be in Θ. This can be shown by following the argument at the beginning of the proof of Lemma 10. Therefore the algorithm defined in Section 3.1 can also be applied to the unconstrained formulation of the objective.
It is clear that these problems are essentially identical, asθ u is a minimiser of the unconstrained objective if and only ifθ u −μ1 is a minimiser of the constrained objective. Observe that whileθ u ∈ R K , the solution to the constrained objective is in fact (μ,θ c ) ∈ R × Θ, which is the same K-dimensional space only with a different parametersation. In particular,θ c is non-unique if and only ifθ u is non-unique.
Since one can obtain the solution to the constrained objective by solving the unconstrained one and then reparameterising (and vice versa), we are free to assume without loss of generality that w TȲ = 0, soμ = 0, when solving the univariate problem, and will remark where we do this.
Proof of Proposition 2. Suppose, for a contradiction, thatθ k <θ l . Then at least one of the following must be true: Letθ be defined as follows. Setθ r =θ r for all r = k, l. If (32) holds setθ k =θ l and if (33) holds setθ l =θ k . Observe that and that the squared loss ofθ is strictly smaller than the squared loss ofθ, thus contradicting optimality ofθ.
Proof of Proposition 3. In this proof we consider the unconstrained formulation of the objective (31) discussed in Section S1.1. Suppose that (Ȳ k ) K k=1 is such that there are two distinct solutions to (12),θ (1) =θ (2) . Let us assume that the levels are indexed such thatȲ 1 ≤ · · · ≤Ȳ K . Define k } to be the largest index at which the two solutions take different values and note that we must haveθ First consider the case where k * < K. Then for r = 1, 2. We now argue that we must haveθ (2) k * ) + γλ. Indeed, suppose not, and suppose that without loss of generalityθ The directional derivative of the objective in the direction of the binary vector with ones at the indices given by S r and zeroes elsewhere evaluated atθ (r) must be 0. But comparing these for r = 1, 2, we see they are identical except for the term ρ (θ k * +1 −θ (r) k * ), which will be strictly larger for r = 2, giving a contradiction. This then implies that bothθ k * must minimise f k * over θ ≤ t * − γλ since the full objective value is k * ) + 1 2 γλ 2 + (terms featuring only index k * + 1 or higher) for r = 1, 2. We also have that when k * = K, bothθ k * must minimise f k * . Using the functions g k−1 as defined in (15), we have the simple relationship that In particular, properties (i) and (iii) of Lemma 4 hold with f k replaced by g k−1 . These can be characterised as g k−1 (θ k ) =q k,r (θ k ) for θ k ∈ I k,r , where I k,r are the intervals associated with f k andq k,r (θ k ) := q k,r (θ k ) − 1 2 w k (Ȳ k − θ k ) 2 . Note that for each r,q k,r depends on the values ofȲ 1 , . . . ,Ȳ k−1 but not that ofȲ k (observe that q k,r (θ k ) includes a term 1 2 w k (Ȳ k − θ k ) 2 ; see (13)). Now asθ (1) , by Lemma 4 (iii) both must be local minima of f k * , and we have that there must exist distinct r 1 = r 2 such thatθ (1) k * ∈ I k * ,r 1 andθ (2) k * ∈ I k * ,r 2 . Letq k * ,r 1 (x) = a 1 x 2 + b 1 x + c 1 , q k * ,r 2 (x) = a 2 x 2 + b 2 x + c 2 .
Considering all pairs r 1 , r 2 , we see that in order for there to exist two solutionsθ (1) =θ (2) ,Ȳ k * must take values in a set of size at most c(K), for some function c : N → N. Now let : the minimiser of the objective is not unique} ⊆ R K .
What we have shown, is that associated with each element (Ȳ k ) K k=1 ∈ S, there is at least one k * such that |{(Ȳ k ) K k=1 ∈ S :Ȳ k =Ȳ k for all k = k * }| is bounded above by c(K). Now for each j = 1, . . . , K, let S j be the set of (Ȳ k ) K k=1 ∈ S for which the there exists a k * with the property above and k * = j. Note that ∪ j S j = S. Now S j ⊂ R K has Lebesgue measure zero as a finite union of graphs of measurable functions f : R K−1 → R. Thus S has Lebesgue measure zero.
Proof of Lemma 4. Assume, without loss of generality, thatμ = 0. We proceed inductively, assuming that the properties (i) and (iii) hold for f k , and (ii) holds for b k+1 . Additionally we include in our inductive hypothesis that for all x, f k (x − ) ≥ f k (x + ), where we define f k (x − ) and f k (x + ) to be the left-derivative and right-derivative of f k at x, respectively. We note that these trivially hold for the base case f 1 , and the case b 2 can be checked by direct calculation.
We first prove (i), that f k+1 is continuous, coercive, and piecewise quadratic and with finitely many pieces. We then show that f k+1 (x − ) ≥ f k+1 (x + ) for all x, which allows us to show that (iii) holds for f k+1 . Finally, we use these results to show that (ii) holds for b k+2 .
We now show that f k+1 is coercive and continuous. Clearly g k (x) ≥ min y≤x f k (y), so it follows that g k (x) → ∞ as x → −∞ as f k is coercive. Furthermore g k is bounded from below as f k is coercive and continuous. Thus since f k+1 (x) = g k (x) + 1 2 w k+1 (Ȳ k+1 − x) 2 , it follows that f k+1 is coercive. Next as g k (x) = min y≤x f k (y) + ρ(y − x), and f k and ρ are continuous, it follows that g k is continuous and therefore that f k+1 is continuous.
To see why f k+1 is piecewise quadratic with finitely many pieces, we observe that it can be written We have by our inductive hypothesis that f k is piecewise quadratic and b k+1 (x) is piecewise linear, both with finitely many pieces. Since the composition of a piecewise linear function inside a piecewise quadratic function is piecewise quadratic, the remainder of (i) is shown.
We now turn our attention to (iii), and define for x ∈ R: We will first show that f k+1 (x + ) ≤ f k+1 (x − ) for all x ∈ R. Suppose that we are increasing x and we have reached a point where g k (x) is not differentiable (that is, the left-derivative and the right-derivative do not match). By assumption (ii) for b k+1 , we can assume that there is some window δ > 0 such that y * (t) is linear for t ∈ (x − δ, x), say y * (t) = α + βt.
With this established, we have that: .
Note that f k has both left-derivatives and right-derivatives at every point in R. Suppose first that β ≥ 0, and we observe that Then by the basic definition of the right-derivative, where the last inequality follows from our inductive hypothesis that f k (y + ) ≤ f k (y − ) for all y ∈ R. An analogous argument shows that the same conclusion holds when β < 0. Now we use this to prove the claim. Because there are no points of f k+1 at which the left-derivative is less than the right-derivative, without loss of generality we claim that f k+1 is differentiable at y * (x) for all x, unless y * (x) = x. Indeed, suppose not, then we have that f k+1 (y * (x) − ) > f k+1 (y * (x) + ) and necessarily that defining h(y) := f k+1 (y) + ρ(x − y), we have 0 ∈ ∂h(y * (x)). But since h(y * (x) + ) < h(y * (x) − ), we contradict the optimality of y * (x) as this point is in fact a local maximum.
We finally consider claim (ii). By (iii), we have that for every point x, y * (x) is either x or at the minimum of one of the quadratic pieces of f k+1 (·) + ρ(x − ·). In either case, we have that y * (x) is linear in x and thus f k+1 (y * (x)) + ρ(x − y * (x)) is quadratic in x. We can define g k+1 (x) pointwise as the minimum of this finite set of quadratic functions of x, whose expressions are given in Appendix A.1. Importantly, the coefficients in the linear expression y * (x) of x depend only on which of these functions is the minimum at x. As the number of intersections between elements in this set of quadratic functions is bounded above by twice the square of the size of the set, we can conclude that b k+2 (x) is piecewise linear and with a finite number of pieces, thus concluding the proof.

S1.3 Computation time experiments
A small experiment was performed to demonstrate the runtimes one can expect in practice for the univariate problem. Note that this clustering is applied iteratively in the block coordinate descent procedure we propose to use in multivariate settings. We considered 3 settings: one with no signal, one with 2 true clusters and one with 5 true clusters. Independent and identically distributed Gaussian noise was added to each of the subaverages. As in Section 6.3 the number of categories was increased by random splitting of the levels. Each of these tests were repeated 25 times, on a computer with a 3.2GHz processor. The results are shown in Figure 8.

S1.4 Discretised algorithm
For very large-scale problems, speed can be improved if we only allow coefficients to take values in some fixed finite grid, rather than any real value. Below we describe how such an algorithm would approximately solve the univariate objective (12). We will use the unconstrained objective as discussed in Section S1.1. We would first fix L grid points ϑ 1 < · · · < ϑ L , and then proceed as described in Algorithm 3. This algorithm has the same basic structure to the approach we use in Section 3.1 for computing the exact global optimum. The difference is that now, instead of as in (14), we define f k in the following way: The objects F and B play analogous roles to f k and b k in Section 3.1. Since we restrict θ k ∈ {ϑ 1 , . . . , ϑ L }, we only need to store the values that f k takes at these L values; this is the purpose of the vector F in Algorithm 3. Similarly, the rows B(k, ·) serve the same purpose as the functions b k where, again, we only need to store L values corresponding to the different options for θ k .

9:
Set F new (l) = F old (B(k, l)) + ρ(ϑ l − ϑ B(k,l) ) + 1 2 w k (Ȳ k − ϑ l ) 2 10: end for 11: end for 12: Set B * (K) = arg min F new , andθ K = ϑ B * (K) 13: for k = K − 1, . . . , 1 do 14: Set B * (k) = B(k + 1, B * (k + 1)), andθ k = ϑ B * (k) 15: end for This algorithm returns the optimal solutionθ to the objective where each of the coefficients are restricted to take values only in {ϑ 1 , . . . , ϑ L }. We must ensure that the grid of values has fine enough resolution that interesting answers can be obtained, which requires L being sufficiently large. The number of clusters obtained by this approximate algorithm is bounded above by L, so this must not be chosen too small.
One can see that the computational complexity of this algorithm is linear in K, with a total of O(KL 2 ) operations required. This is of course in addition to the O(n) operations needed to compute w 1 , . . . , w K andȲ 1 , . . . ,Ȳ K beforehand. In particular, choosing L √ K guarantees that the complexity of this algorithm is at worst quadratic in K.

S2.1 Proof of Theorem 5
The proof of Theorem 5 requires a number of auxiliary lemmas, which can be found in Section S2.1.1.

S2.1.1 Auxiliary lemmas
Here we prove a number of results required to obtain conditions for recovering the oracle least squares estimate in the univariate case. Lemma 10 gives conditions for recovery of the true solution, in the case where there is zero signal. Lemmas 8 and 9 ensure that the true levels are far enough apart that they can be separated. Once we have this separation, we apply Lemma 10 on each of the levels to obtain the solution.

Equation (2.3) of
If γλ < 2τ , then we can see that the minimum of F over [γλ, 2τ ] will be attained at exactly 2τ . Thus, here it also suffices to check F (0) < F (2τ ), which holds if and only if τ < γ/κλ/2. The final bound τ < (1 ∧ √ κγ)λ/2κ follows from combining the results for these cases.
The following is a deterministic result to establish separation between groups of coefficients.
We now proceed to show that for j = 1, . . . , s − 1, we haveθ k j ≤Ȳ k j andθ k j +1 ≥Ȳ k j +1 . We prove the first of these sets of inequalities, since the second follows similarly by considering the problem with −θ, −Ȳ and reversing the indices. Suppose, for contradiction, that there exists some j in {1, . . . , s − 1} withθ k j >Ȳ k j . Let this j be minimal, such that for all l < j we havê θ k l ≤Ȳ k l .
With this knowledge, we can further simplify (48) to obtain We can observe that (52) is the sum of two copies of (46) in case (A1). Hence, by following the Ȳ k j <θ k <θ k j . Then we constructθ by θ l = θ k for l = k, k + 1, . . . , k ĵ θ l otherwise.
We observe that the penalty contribution fromθ is no more than that ofθ and that the quadratic loss forθ will be strictly less than that ofθ. This gives us that Q(θ) < Q(θ), contradicting the optimality ofθ. Similarly, ifθ k j−1 +1 <Ȳ k j−1 +1 then the corresponding statement that for any k with k j−1 + 1 ≤ k j , ifθ k <Ȳ k j−1 +1 thenθ k =θ k j−1 +1 .
We now establish a simple preliminary result. Suppose that for some j in {1, . . . , s} there exists k in {k j−1 + 1, . . . , To prove the claim, we consider the caseθ k >Ȳ k j (the other is identical). By the first observation, ifθ l >Ȳ k j for l in {k j−1 + 1, . . . k j } thenθ l =θ k . Now, for contradiction, supposê θ k >Ȳ k j + ( 2s/η √ γλ ∨ γλ) and let this k be minimal. Then we can constructθ by Now, in order to contradict the optimality ofθ we construct a new feasible pointθ by setting for l = k , . . . , k ĵ θ l for l = k j + 1, . . . , l − 1 θ l otherwise.
Proof. Let P w = I − 1w T /w and D w ∈ R K×K be the diagonal matrix with entries D kk √ w k .
First note that Thus for all θ ∈ R K , we have Consider minimising F over R K × [−τ, τ ] K × S, where S ⊆ R K is the unit simplex scaled by w. We aim to show this minimum is 0. As with the first claim in the proof of Lemma 8, it is straightforward to see that for any feasible (θ, ξ, w), there exists θ with θ ∞ ≤ ξ ∞ and F (θ , ξ, w) ≤ F (θ, ξ, w). Hence, As on the RHS we are minimising a continuous function over a compact set, we know a minimiser must exist. Let (θ,ξ,w) be a minimiser (to be specified later). Observe that is linear as a function of ξ. Hence it is minimised over the set Here conv(·) denotes the convex hull operation. We thus have Let us take (θ,ξ) ∈ R K × {−τ, τ } K to be a minimiser of the RHS. Note that if we haveξ j =ξ k then we may takeθ j =θ k . Indeed, we may constructθ ∈ R K by settingθ Since the penalty contribution fromθ is not greater than that ofθ, it follows that Q(θ) ≤ Q(θ). Thus, we can assume that entries ofθ can take one of only two distinct values. Next we writeα = k:ξ k =−τw k and observe thatw Tξ = (w − 2α)τ . Let us set s = min kθk and x = max kθk − min kθk . Then we have In the second line above, we have solved for s to find that In the third line above, we have solved forα to obtainα =w/2 and henceα(w −α)/w =w/4. These follow from optimality ofθ andw respectively. The result follows from applying Lemma 7, setting κ =w/4.

S2.2 Proof of Theorem 6
We begin by defining P 0 to be the orthogonal projection onto the linear space The residuals from the oracle least-squares fit are (I − P 0 )Y = (I − P 0 )ε. The partial residuals R (j) as defined in (18) for the jth variable are therefore For j = 1, . . . , p, we defineR i /n jk for k = 1, . . . , K j , reordering the labels such thatR (j) 1 ≤ · · · ≤R (j) K j . We then aim to apply the arguments of Theorem 5 toθ j defined byθ j ∈ arg min In order to do this, we define the events (for some τ j to be determined later): On the intersection of events ∩ an identical approach to that involved in computing (35), we have that where we recall that w jk = n jk /n. We now turn our attention to the event Λ (1) j . Note that if s j = 1, then this is immediately satisfied sinceθ 0 j = θ 0 j = 0. If s j > 1, we use that the oracle least squares estimateθ 0 = AY is a linear transformation A of the responses (Y i ) n i=1 . For each i = 1, . . . , n, Y i has an independent (non-central) sub-Gaussian distribution with parameter σ. Therefore for each k = 1, . . . , K j , θ 0 jk − θ 0 jk also has a sub-Gaussian distribution, with parameter at most σc −1/2 min (recalling that c min = (max l (AA T ) ll ) −1 ). This enables us to show that We can now set τ j = √ ηγ * j s j λ j /2. From (26) and the triangle inequality, on the event Λ (1) j we have that Thus, on the intersection of events Λ (1) jk , we can proceed as in the proof of Theorem 5 from (38), to conclude thatθ j =θ 0 j . It immediately follows that on the intersection of events ∩ p j=1 Λ (1) (2) jk , we havê θ =θ 0 . By a union bound, this occurs with probability at least P ∩ p j=1 Λ (1) exp − n j,min ηγ * j s j λ 2 j 8σ 2 + log(K j ) + exp − c min ηγ * j s j λ 2 j 8σ 2 + log(s j ) ≥ 1 − 4 p j=1 exp − (n j,min ∧ c min )ηγ * j s j λ 2 j 8σ 2 + log(K j ) , where in the final line we use s j ≤ K j .

Tree-based methods
We used the implementation of the random forest procedure  in the R package randomForest  with default settings. CART  was implemented in the R package rpart [Therneau and Atkinson, 2019], with pruning according to the 1-SE rule (as described in the package documentation).

CAS-ANOVA
The CAS-ANOVA estimatorθ cas optimises over (µ, θ) a sum of a squared loss term (3) and an all-pairs penalty term (4). In particular,  consider two regimes of weight vectors w. The first is not data-dependent and sets w j,k 1 k 2 = (K j + 1) −1 √ n jk 1 + n jk 2 . The second, 'adaptive CAS-ANOVA', uses the ordinary least squares estimate for θ to scale the weights. Here, w j,k 1 k 2 = (K j + 1) −1 √ n jk 1 + n jk 2 |θ OLS jk 1 −θ OLS jk 2 | −1 . Here we introduce a new variant of adaptive CAS-ANOVA, following ideas in Bühlmann and Van De Geer [2011] for a 2-stage adaptive Lasso procedure. Instead of using the ordinary least squares estimateθ OLS in the above expression, an initial (standard) CAS-ANOVA estimate is used to scale the weights, with λ selected for the initial estimate by 5-fold cross-validation. In simulations, this outperformed the adaptive CAS-ANOVA estimate using ordinary least squares initial estimates so in the interests of time and computational resources this was omitted from the simulation study. Henceforth adaptive CAS-ANOVA will refer to this 2-stage procedure. The authors describe the optimisation ofθ cas as a quadratic programming problem, which was solved using the R package rosqp [Anderson, 2018]. Here we used our own implementation of the quadratic programming approach described by the authors. We found it considerably faster than the code available from the authors' website, and uses ADMM-based optimisation [Boyd et al., 2011] tools not available at the time of its publication. We also found, as discussed in Section 5.1 of Maj-Kańska et al. [2015], that we could not achieve the best results using the publicly available code. Lastly, using our own implementation allowed us to explore a modification of CAS-ANOVA using the more modern approach of adaptive weights via a 2stage procedure [Bühlmann and Van De Geer, 2011] to compare SCOPE to a wider class of all-pairs penalty procedures.
For large categorical variables, solutions are slow to compute and consume large amounts of memory. In the case of binary response, CAS-ANOVA models were fitted iterating a locally quadratic approximation to the loss function.

DMR
The DMR algorithm  is implemented in the R package DMRnet [Prochenka-So ltys and Pokarowski, 2018]. The degrees of freedom in the model is decided by 5-fold crossvalidation. It is based on pruning variables using the Group Lasso [Yuan and Lin, 2006] to obtain at a low-dimensional model, then performing backwards selection based on ranking t-statistics for hypotheses corresponding to each fusion between levels in categorical variables.
The cross-validation routine appeared to error when all levels of all categorical variables were not present in one of the folds. In Section 6.2, cross-validation was therefore not possible so model selection was performed based on Generalized Information Criterion (GIC) [Zheng and Loh, 1995]. In all other examples, models were selected via 5-fold cross-validation.

Bayesian effect fusion
In Section 6.1.1 we include Bayesian effect fusion , implemented in the R package effectFusion . Coefficients within each categorical variable were modelled with a sparse Gaussian mixture model. The posterior mean was estimated with 1000 samples.

Lasso
In Section 6.1.2 we also include Lasso [Tibshirani, 1996] fits, to serve as a reference point. Of course, this is unsuitable for models where levels in categorical variables should be clustered together, but the advanced development of the well-known R package glmnet  nevertheless sees its use in practice.
In order to make the fit symmetric across the categories within each variable, models were fitted with an unpenalised intercept and featuring dummy variables for all of the categories within each variable. This is instead of the corner-point dummy variable encoding of factor variables that is commonly used when fitting linear models. Models are fitted and cross-validated with cv.glmnet using the default settings.

SCOPE
For SCOPE, we have provided the R package CatReg [Stokell, 2021]. The univariate update step (see Section 3.1) is implemented in C++ using Rcpp [Eddelbuettel and François, 2011], with models fitted using a wrapper in R. For the binary response case, the outer loop to iterate the local quadratic approximations in the proximal Newton algorithm are done within R. In the future, performance could be improved by iterating the univariate update step (and the local quadratic approximations, as in Sections 6.2 and 6.3) within some lower-level language. In higher-dimensional experiments, SCOPE was slowed by cycling through all the variables; an active-set approach to this could make it faster still.

S3.2 Further details of numerical experiments
For the experiments in Section 6.1, we define the signal-to-noise ratio (SNR) as σ S /σ, where σ S is the standard deviation of the signal Y − ε, and σ is the standard deviation of the noise ε.

S3.2.1 Low-dimensional simulations
In Table 7 we include details of computation time and dimension of the fitted models. Figure 9 visualises the results also summarised in Table 1 in the main paper.

S3.2.2 High-dimensional simulations
Here we include additional results relating to the high-dimensional experiments. Figure 10 visualises the results in Table 2