Multiresolution categorical regression for interpretable cell type annotation

In many categorical response regression applications, the response categories admit a multiresolution structure. That is, subsets of the response categories may naturally be combined into coarser response categories. In such applications, practitioners are often interested in estimating the resolution at which a predictor affects the response category probabilities. In this article, we propose a method for fitting the multinomial logistic regression model in high dimensions that addresses this problem in a unified and data-driven way. In particular, our method allows practitioners to identify which predictors distinguish between coarse categories but not fine categories, which predictors distinguish between fine categories, and which predictors are irrelevant. For model fitting, we propose a scalable algorithm that can be applied when the coarse categories are defined by either overlapping or nonoverlapping sets of fine categories. Statistical properties of our method reveal that it can take advantage of this multiresolution structure in a way existing estimators cannot. We use our method to model cell type probabilities as a function of a cell's gene expression profile (i.e., cell type annotation). Our fitted model provides novel biological insights which may be useful for future automated and manual cell type annotation methodology.


Introduction
Multinomial logistic regression is one of the most widely used approaches for modeling the probability mass function of a K-category random variable as a function of a p-dimensional predictor (Agresti, 2003, Chapter 8).This model assumes that the random response category counts for the ith subject, Y i , can be modeled where n i is the number of independent trials for the ith subject, x i = (1, x i,2 , . . ., x i,p ) ⊤ ∈ R p is the p-dimensional predictor for the ith subject, and π * k (x i ) is the probability of the kth category in a single trial for the ith subject for k ∈ {1, . . ., K} =: [K].The multinomial logistic regression model assumes the link where β * = (β * :,1 , . . ., β * :,K ) ∈ R p×K is an unknown matrix of regression coefficients.For the remainder, let β * j,: ∈ R K be the jth row of β * , let β * :,k ∈ R p be the kth column of β * , and let [a] denote the set {1, . . ., a} for positive integer a.
In this article, we focus on fitting (2) in settings where the K response categories can be coarsened at multiple resolutions.By coarsened, we mean that the union of two or more of the K response categories constitutes a coherent (as determined by the application) category.As a toy example, suppose the set of response categories were {Sweden, Norway, United States, Canada, Brazil, Argentina}.Possible coarsened categories include Scandinavia = {Sweden, Norway}, North America = {United States, Canada}, Northern Hemisphere = {Sweden, Norway, United States, Canada}, and Americas = {United States, Canada, Brazil, Argentina}.In applications where the response exhibits this structure, a practitioner is often interested in understanding the resolution at which a predictor affects the response.In the preceding example, a practitioner may consider fitting separate versions of (2) with responses {Northern Hemisphere, Southern Hemisphere}, or {Scandinavia, North America, South America}, or {Scandinavia, Americas}, and so on.Each fitted model may provide insights about the relationship between predictor and response which are obscured by the other models.When the dimension of the predictor p, number of response categories K, and number of coarse categories are large, this becomes a combinatorially difficult modeling task.
Our work is motivated by the task of assigning an individual cell a label corresponding to its biological function, i.e., cell type annotation-a key step in many single-cell RNA-seq data analyses (Pasquini et al., 2021).Specifically, we model cell type probabilities as a function of a cell's p-dimensional gene expression profile measured by single-cell RNA sequencing.
Table 1: Coarse categories in the single-cell dataset from Hao et al. (2021).Indented coarse cell type descriptions indicate that the cell type is a subset of its preceding coarse cell type.Four fine cell types (Eryth, HSPC, ILC, Platelet) do not belong to any coarse category and do not appear in the table.Superscripts are omitted from fine cell types for ease of display.For details, see Figure 2 of Maecker et al. (2012) and references therein.
One crucial aspect of building such a model is determining which genes' expression distinguishes different cell types.A well-estimated set of discriminative genes will not only yield better classification accuracy, but can provide useful scientific insights and improve model interpretability (Dumitrascu et al., 2021).Thus, our goal is to develop a method which performs model estimation and gene selection jointly for the purpose of interpretable cell type annotation.
Cell type, the categorical response, naturally admits the aforementioned "multiresolution" structure.For example, a subset of the cell types in our motivating data analysis include {CD8 + Naive, CD8 + TCM, CD8 + TEM}, where TCM and TEM denote central memory and effector memory T cells, respectively.In this setting, possible coarsened categories of these cell types include CD8 + T cells = {CD8 + Naive, CD8 + TCM, CD8 + TEM} and CD8 + memory T cells = {CD8 + TCM, CD8 + TEM}.A particular gene may be useful for distinguishing CD8 + T cells from the others, but may not be useful for discriminating between the subtypes of CD8 + T cells.Conversely, a gene may distinguish between CD8 + Naive and CD8 + memory T cells, but cannot distinguish between the two subtypes of CD8 + memory T cells.To the best of our knowledge, our method is the first that performs gene selection wherein the effect of a particular gene can be interpreted in such a way.In so doing, our method addresses a recurring theme among the "grand challenges" in single-cell data science (Lähnemann et al., 2020) by accounting for the multiresolution ("varying levels of resolution") of cell types.In Table 1, we provide the set of coarse categories we use in our analysis in Section 6.
The new estimator proposed in this article, loosely speaking, allows practitioners to consider models at all possible response resolutions simultaneously in a unified and datadriven way.As we will demonstrate, exploiting the multiresolution structure of the response categories can lead to more interpretable fitted models and better estimates of the mass function of interest, (1), than existing approaches.
To see how one could consider models at various response resolutions simultaneously, suppose all but the jth predictor is held fixed.Then, the jth predictor can affect the response probabilities at one of multiple possible resolutions: the coarsest resolution, the finest resolution, at a coarse resolution for some categories and fine resolution for others, or not at all.Formally, we say the jth predictor does not distinguish between fine categories in A l if for any x ∈ R p and δ ∈ R, for all pairs (k, k ′ ) ∈ A l × A l , where e j is jth standard basis vector in R p .That is, (3) says that the probability of the response taking category k relative to category k ′ is not affected by the jth predictor.Under model (1), (3) holds if β * j,k − β * j,k ′ = 0. Thus, restated in terms of β * , the jth predictor does not distinguish between fine categories in A l if β * j,A l ∝ 1 a l , where a l = |A l |, 1 a l = (1, . . ., 1) ⊤ ∈ R a l , and β * j,A l ∈ R a l is the subvector of β * j,: consisting of the elements indexed by A l .Continuing with the example with six fine response categories, we provide a hypothetical submatrix of β * in Figure 1 (b).In this figure, each cell represents a regression coefficient, and each row corresponds to a distinct predictor.Within a row, each distinct color represents a distinct numeric value.For example, the leftmost three coefficients in the top row are distinct, and the rightmost three coefficients are all identical (though distinct from the leftmost three).Based on this structure, the corresponding predictor distinguishes between all fine response categories in A 1 and A 2 , but does not distinguish between the fine categories in A 4 .By the same logic, the predictor corresponding to the middle row distinguishes categories 3 and 6 from all others, but does not distinguish between fine categories in coarse categories A 1 or A 3 .Finally, because all the coefficients in the bottom row are identical, the corresponding predictor does not distinguish between any of the response categories.Hence, if β * j,: ∝ 1 K , then the jth predictor is irrelevant (i.e., do not distinguish between any categories).
In the next section, we propose a method which allows one to explore these types of blockwise structures along a continuous solution path using convex optimization.The path is controlled by a pair of tuning parameters which can be chosen using cross-validation, i.e., the user need not specify how the predictor affects the response.
It is important to emphasize that our method does not require that fine categories belong to a single coarse category, nor does it require that coarse categories adhere to a treestructured hierarchy (as in Figure 1).All that our method requires is a collection of sets A l ⊂ [K] for l ∈ [L] such that all the categories whose indices belong to A l constitute a meaningful (as determined by the application) coarse category.

Penalized maximum likelihood estimator
Let y 1 , . . ., y n be n independent realizations from (1) based on predictors x 1 , . . ., x n , respectively.Specifically, y i = (y i,1 , . . ., y i,K ) ⊤ where y i,k is the count for the ith subject's kth category.The (scaled by 1/n) negative log-likelihood for (1), ignoring constants, is Without loss of generality, we assume that n i = 1 for i ∈ [n] for the remainder of the article.
To obtain estimates of β * which account for the multiresolution structure of the response, we use a penalized maximum likelihood estimator.Recall that A l ⊂ [K] is the set of indices corresponding to the lth coarse category, and define a l as the cardinality of A l .We propose the multiresolution penalty Ω A given by where λ > 0 is a user-specified tuning parameter, w l > 0 is a user-specified weight for the lth coarse category, and ∥ • ∥ 2 is the Euclidean norm.For sufficiently large values of λ, this penalty will eventually require that for pair (j, l), β j,A l = c1 a l for scalar c ∈ R. As discussed in Section 2.1, β * j,A l = c1 a l would imply that the jth predictor is not useful for distinguishing between fine categories within the lth coarse category.Note that the outer summation in λΩ A (β) begins at j = 2 because we do not penalize the intercept by default.Adjusting our method to also penalize the intercept requires only trivial modifications.
Intuitively, Ω A functions like a group lasso penalty (Yuan and Lin, 2006), but rather than shrinking all coefficients in a group towards zero, it shrinks all coefficients towards a nearby constant.Because the c minimizing the (j, l)th term in the penalty is simply the average of Based on this characterization, it is clear that Ω A (β) is a seminorm and thus convex.For the remainder, we take w l = 1 though other choices of w l may be useful in certain applications.
While the penalty Ω A can exploit the special structure of β * described in Section 2, it will not perform variable selection (i.e., removal of irrelevant predictors).As mentioned in Section 2, for a predictor to be irrelevant to (1), it must be that all entries of the corresponding row of β * are equivalent.Thus, to achieve variable selection, a natural approach is to use a group lasso penalty on the rows of β.For large values of the tuning parameter corresponding to this penalty, this will require entire rows of β to be equal to zero.Beyond variable selection, the group lasso penalty also implicitly imposes the identifiability condition that each penalized row of β sum to zero (Molstad and Rothman, 2023).With the multiresolution penalty Ω A in hand, we propose to estimate β * using arg min where γ > 0 is an additional tuning parameter controlling the contribution of the group lasso penalty.For the remainder of the article, we will refer to the estimator defined in (4) as mrMLR, the multiresolution multinomial logistic regression estimator.

Existing methods
An enormous number of methods exist for cell type annotation, e.g., see the survey of Pasquini et al. (2021) and references therein.The methods most closely related to our own-categorized "supervised methods" in Pasquini et al. (2021)-are often based on nonparametric classifiers like random forests, k-nearest neighbors, and neural networks.These methods are not model-based and thus, fitted models are often difficult to interpret.In addition, these methods often fail to exploit the multiresolution structure of cell types.Methods that do take this structure into account often rely on fitting many conditional models (de Kanter et al., 2019;Bernstein et al., 2021)-as opposed to directly modeling (1)-which further complicates variable selection and predictor effect interpretation.Many penalized likelihood-based methods exist for fitting the multinomial logistic regression model with high-dimensional predictors.For example, Zhu and Hastie (2004) use a penalty on the squared Frobenius norm of the regression coefficient matrix, whereas Vincent and Hansen (2014) use a sparse group lasso penalty on the rows of β.Neither of these methods could exploit the special structure of the response discussed in Section 1.Other estimators assume β * has low rank (Yee and Hastie, 2003;Powers et al., 2018).Reduced-rank estimators of β * do not make explicit use of known multiresolution structures, nor can they easily be modified to perform variable selection.
Other work related to our own is that of Price et al. (2019) and Nibbering and Hastie (2022), both of whom fit (1) with a penalty on the Euclidean norm of β :,k −β :,k ′ for all k ̸ = k ′ .This encourages estimates of β * such that their kth and (k ′ )th columns are identical for some k ̸ = k ′ .This approach thus "combines" response categories as their estimated probabilities would always be identical if β * :,k = β * :,k ′ (i.e., one could never predict category k over category k ′ and vice versa).For cell type annotation, this is not practically useful.
In Web Appendices G.1 and G.2, we contrast our work to the related methods of Motwani et al. (2023) and Molstad and Rothman (2023).In G.3, we situate our estimator among existing methods for classification that exploit a known hierarchy between response categories, and among methods for effect aggregation in regression (Yan and Bien, 2021).

Algorithm overview
The optimization problem to compute our estimator, mrMLR, is convex.Moreover, the multinomial negative log-likelihood is differentiable and has Lipschitz continuous gradient.We can thus apply modern first-order methods-namely, variants of the proximal gradient descent algorithm-to compute mrMLR.The proximal gradient descent algorithm is an iterative procedure that defines its (t + 1)th iterate as the minimizer of a penalized quadratic approximation to the objective function constructed at the (t)th iterate (Polson et al., 2015, Section 3.1).Given step size τ > 0, this algorithm defines the (t + 1)th iterate of β, β (t+1) , as arg min For sufficiently small and fixed τ > 0, the sequence of iterates generated by (5) will converge to a solution to (4) as t → ∞ (Polson et al., 2015, Section 3.1).We present an accelerated version of this algorithm, which we implement, in Web Appendix C. Briefly, the accelerated version extrapolates a search point based on previous iterates and performs a backtracking line search to select the step size τ (Beck and Teboulle, 2009, Section 4).For simplicity, we discuss the algorithm based on iterate (5).The crux of this algorithm is computing (5) efficiently.Note that the solution to (5) can be obtained row-by-row of β, so we focus on computing the jth row.For j = 1, β where λ = τ λ, γ = τ γ, η ∈ R K is the jth row of β (t) − τ ∇G(β (t) ), and ν A l ∈ R a l is the subvector of ν whose components are indexed by A l for l ∈ [L].Hence, computing each iterate (5) requires solving (6).For that, we have the following lemma.
Lemma 1 The solution to (6), denoted νγ, λ, is given by where ν0, λ is defined as the solution to (6) with γ = 0, i.e., Based on the result of Lemma 1-a proof of which can be found in Web Appendix A-we see that if we compute (8), we can immediately obtain (6) based on the equality (7).Of course, (8) depends on the A l .If the A l are nonoverlapping (their intersection is empty), ( 8) can be solved in closed form.If any of the A l overlap, we must use an iterative procedure to compute (8).We will discuss these cases separately.Without loss of generality, we proceed as if every element of [K] belongs to at least one A l .For any k such that k ̸ ∈ ∪ L l=1 A l , the kth component of ν0, λ is the kth component of η.

Nonoverlapping coarse categories
In the situation that the A l are nonoverlapping, we have the following result which provides a closed form for (8).We prove this result in Web Appendix A.
Theorem 1 Suppose the A l are nonoverlapping.Let [ν 0, λ] A l ∈ R a l (resp.η A l ) denote the subvector of ν0, λ (resp.η) whose components are indexed by A l for l ∈ [L].It follows that the solution to (8), denoted ν0, λ, is defined subvector-by-subvector with Based on Theorem 1, we see that if λ is sufficiently large, [ν 0, λ] A l will simply be the average of η A l times the a l -dimensional vector of ones.When λ is sufficiently small, [ν 0, λ] A l will be a convex combination of η A l and the average of η A l times the a l -dimensional vector of ones.

Overlapping coarse categories
When the A l overlap, there is no closed form solution for (8) in general.Instead, we rely on an efficient iterative algorithm to solve the dual problem of (8).The dual problem for ( 8) is minimize where and [M (A l )] j,k = 0 for (j, k) ̸ ∈ A l × A l .Given a minimizer of ( 9), ζ, the solution to ( 8) is η − L l=1 M (A l ) ζ:,l .To solve (9) we use a blockwise coordinate descent algorithm (Algorithm 2 of Web Appendix C) inspired by Yan and Bien (2017), who studied variations of the overlapping group lasso.This algorithm cycles through each l ∈ [L], defining the (r + 1)th iterate of ζ :,l , ζ where Hence, we can quickly cycle through the ζ :,l in order to solve (9).Each optimization problem is only K-dimensional and we need only solve (8) for j ∈ {2, . . ., p} such that ∥[β (t) − τ ∇G(β (t) )] j,: ∥ 2 > γ, as it can be shown that if ∥[β (t) − τ ∇G(β (t) )] j,: ∥ 2 ≤ γ, then the corresponding νγ, λ = 0. Further refinements of the algorithm for solving (8) with overlapping A l are discussed in Web Appendix C.

Statistical properties
In this section, we establish an asymptotic error bound for our estimator.To simplify matters, we treat the x i as nonrandom and assume that the intercept is omitted (i.e., the first entries of the x i need not be one).Defining the norm ∥β∥ 1,2 = j ∥β j,: ∥ 2 , we study the estimator β = arg min β∈R p×J {G(β)+γ∥β∥ 1,2 +λ p j=1 L l=1 ∥D(a l )β j,A l ∥ 2 } for the remainder of this section.A proof of the ensuing results, which follow from arguments similar to those employed in Molstad and Rothman (2023), can be found in Web Appendix B.
Since for any β * ∈ R p×K , β * d = β * − d1 ⊤ K for any d ∈ R p will lead to the same probabilities (2) for all x ∈ R p , we must first define the version of the regression coefficient matrix from (2) for which our estimator will be consistent.For a given β * (i.e., any matrix which leads to the population level probabilities for all x ∈ R p ), our estimator is consistent for , the version of β * with row-wise average zero.See Web Appendix B for an explanation.
Next, we discuss the conditions and assumptions needed for our theoretical results.First, we require that X = (x 1 , . . ., x n ) ⊤ ∈ R n×p is appropriately scaled column-wise.
C1.The predictors are scaled so that ∥X :,j ∥ 2 2 ≤ n for all j ∈ [p].We will also require an assumption on the data generating process.

A1.
The response y i is a realization from the n i = 1 multinomial logistic regression model (1) with probabilities as defined in (2) for each i ∈ [n].
Let S ⊂ [p] be the set of predictors which are relevant (i.e., β † j,: ̸ = 0 for j ∈ S) and let S c = [p] \ S. By definition of β † , the kth predictor is irrelevant if β † k,: = 0. Similarly, for each A l , define S l = {k ∈ [p] : β † k,A l ̸ = c1 a l for any c ∈ R} as the set of predictors that distinguish between fine categories belonging to the lth coarse category.Naturally, S c l = [p]\S l is the set of predictors which do not distinguish between the fine categories in coarse category l (i.e., if k ∈ S c l , then β † k,A l = c1 a l for some c ∈ R).Define the collection of sets S = {S, S 1 , . . ., S L }.Our second assumption, which depends on C( S, ϕ)-a restricted subset of R p×K defined in Web Appendix B-is a restricted eigenvalue-type condition.

A2. For each (ϕ
The assumption A2 is analogous to the standard restricted eigenvalue condition in linear regression where G is squared error loss.Lastly, we define the subspace compatibility constant (Negahban et al., 2012) as Ψ A ( S) = sup ∆∈R p×K ,∥∆∥ F =1 L l=1 k∈S l ∥D(a l )∆ k,A l ∥ 2 .The compatibility constant, Ψ A ( S), quantifies the coherence between the penalty Ω A and the structure of β † .If few predictors distinguish fine categories within most coarse categories, Ψ A will be small.Conversely, if many predictors distinguish fine categories within many coarse categories, the penalty and the structure of β † do not cohere, so Ψ A will be large.More concretely, for any set of coarse categories one can show that Ψ A ( S) ≤ L l=1 |S l ||A l |.We are finally ready to state our main result.
Theorem 2 Let α ∈ (0, 1), ϕ 1 > 1, and ϕ 2 > 0 be fixed constants.Suppose C1, A1, and A2 hold.Define Of course, |S| is the number of relevant predictors, whereas the magnitude of Ψ A ( S) is controlled by the cardinality of the S l and A l .Our bound thus agrees with intuition: if many predictors are important for distinguishing between fine categories within many coarse categories, our penalty Ω A will hurt estimation by inappropriately biasing our estimator.
Conversely, if many of the S l have few elements, the bound can be improved by taking ϕ 2 large, which, loosely speaking, has the effect of reducing the volume of C( S, ϕ), and thus inflating κ( S, ϕ).

Simulation studies
To demonstrate the performance of our method, we compare it to multiple competitors under various data generating models and predictor dimensions.For 100 independent replications in each setting, we generate 500 training and validation samples, and 10 4 testing samples.
For each sample, we first generate x, a realization of N p (0, Σ) where Σ j,k = 0.7 |j−k| for (j, k) . Then, we generate the response, a K = 12 category single-trial multinomial random variable with probabilities π We set A 1 = {1, 2, 3}, A 2 = {4, 5, 6}, A 3 = {7, 8, 9}, and A 4 = {10, 11, 12}.We generate β * independently in each replication according to one of six models, Models 1-6.For each model, we first randomly select 18 predictors to be relevant, then randomly select s of those to be useful for distinguishing between coarse categories: the remaining 18 − s are useful for distinguishing between all fine categories.All other p − 18 predictors are irrelevant.Nonzero values of β * are drawn independently N(0, 5).For j ∈ {1, . . ., 6}, Model j sets s = 3(j − 1).Under Model 1, all predictors are useful for discriminating between fine categories.Under Model 6, only 3 of the 18 predictors are useful for discriminating between all fine categories.Models 2-5 are intermediate to Models 1 and 6 in terms of the number of the 18 relevant predictors which are useful for distinguishing all fine categories.We provide pseudocode for constructing β * in Web Appendix E.
We compare our estimator mrMLR, to multiple competitors.The first competitor, which we call Group, is a special case of ( 4) with λ = 0.By fixing λ = 0, Group can perform variable selection, but fails to exploit the multiresolution structure of the response categories.The second competitor, L1, is simply arg min β∈R p×K {G(β) + γ p j=2 K k=1 |β j,k |}.Unlike mrMLR and Group, L1 does not encourage variable selection directly.However, we include it because it is arguably more flexible than Group, and thus may be able to account for multiresolution structure of the response categories implicitly.
We also compare to a two-step approximation to mrMLR, Approx, which can take advantage of the multiresolution structure using multiple conditional models.We first describe this method in the context of nonoverlapping A l .In the first step, we construct surrogate responses ỹi ∈ {0, 1} L where ỹi,l = j∈A l 1(y i,j = 1) for i ∈ [n].Then, we fit (4) with λ = 0 to the regression of the ỹi , an L-category response, on the x i and select tuning parameters by cross-validation.In the second step, we fit L separate conditional logistic regression models where the lth model has a l response categories and uses only data for subjects where j∈A l 1(y i,j = 1) = 1 (i.e., the lth model conditions on ỹi,l = 1).Just as in the first step, each model is fit using (4) with λ = 0 and tuning parameters chosen by cross-validation.With this set of fitted models, the method Approx then predicts π where Pr 1 is an estimated probability from the model fit in the first step and Pr 2 is the estimated conditional probability from the model fit in the second step.Unlike Group and L1, this method both performs variable selection and can take advantage of the special structure of the responses.
In Figure 2, we display Hellinger distances over 100 independent replications under Models 1-6 with p ∈ {100, 200, 500, 1000}.Under Model 1, we see that Group and mrMLR perform almost identically.This agrees with intuition as all important predictors distinguish between all fine categories.Our method, mrMLR, can adapt to this situation whereas Approx, the only other method making use of the A l , performs very poorly.For Models 2-5, we see that Group gradually begins to perform worse relative to mrMLR.By Model 5 with p = 1000, Approx performs similarly to Group.Under Model 6, where the majority of predictors only distinguish between fine categories, mrMLR and Approx substantially outperform the other competitors.In Wed Appendix D, we also display classification errors, Kullback-Leibler divergences, as well results on support and effect resolution recovery.In brief, the same relative performances are observed under both classification errors and Kullback-Leibler divergences.Our method tends to recover both the set of important variables and their effect resolution of β * with reasonable accuracy, though this is affected by both p and the model.
Beyond the competitors discussed here, we also considered "vanilla" random forests, hierarchical random forests (Kaymaz et al., 2021), and the multiclass sparse discriminant analysis method of Mai et al. (2019).The method of Kaymaz et al. (2021) was specifically designed to exploit cell type hierarchies in the context of cell type annotation.This method uses the same known coarse categories as mrMLR.We display the classification accuracy of these methods versus those considered in Figure 2 in Web Figure 8.The method of Mai et al. (2019) outperforms the random forests, but none of the three methods outperform mrMLR.
In Web Table 3, we provide computing times for mrMLR.The average time to compute our estimator for a 100 × 10 grid of candidate tuning parameters (γ, λ) ranged from 6-80 minutes across the different models and predictor dimension considered in this section.We discuss simple ways to reduce computing times in Web Appendix D.4.
In Web Appendix D.5, we present comprehensive results for a separate simulation study with overlapping coarse categories.Relative performances are very similar to those under the settings considered in this section.
6 Application to interpretable cell type annotation

Overview
We apply our method to a dataset consisting of gene expression profiles for individual peripheral blood mononuclear cells from Hao et al. (2021).Immune cells within the set of peripheral blood mononuclear cells naturally exhibit a multiresolution structure, which we outline in Table 1.Our method is especially useful for this dataset because of the high degree of detail available in the K = 28 cell type labels.As shown in Table 1, there are L = 11 coarse categories we consider, many of which overlap.
Our goal is to fit a model which can be used to predict the type of a cell based on its gene expression profile (measured with RNA-seq).By interpreting the coefficients estimated by our method, we may also identify genes that distinguish groups of cell types at distinct resolutions.The complete dataset (after the quality control steps described in Web Appendix F.1) consists of more than 150000 cells from 8 individuals.Though protein expression data were available for these cells-and were used, in part, to assign cell type labels in Hao et al. (2021)-we focus on model building using only gene expression profiles because the majority of single-cell datasets do not have protein expression profiles available.Before model fitting, we normalize the gene expression profiles as described in Web Appendix F.1.

Comparison to competing methods
We first compare the predictive performance of our method to a subset of competitors from Section 5. To reduce complexity, we perform (unsupervised) predictor screening on the full dataset by ranking genes as described in Web Appendix F.1 and selecting the top p ∈ {500, 750, 1000, 1500, 2000} genes.For each replicate, we randomly split the dataset into training, validation, and testing sets by uniformly sampling without replacement n ∈ {10000, 20000, 30000, 40000, 50000} cells for the training set and 20000 cells each for the validation and testing sets.We choose tuning parameters to minimize the validation set deviance, and measure performance using the testing set deviance and classification error rate.Additionally, we also record the degrees of freedom, defined as the number of distinct fitted parameters in the model, in order to characterize model complexity.
We first assess the test set deviance and classification error rate when varying the number of genes p ∈ {500, 750, 1000, 1500, 2000} with the training set size n = 20000 fixed.We repeat each setting for a total of 50 independent replications, with the training, validation, and testing sets randomly sampled in each replicate as described above.Results for mrMLR and Group are presented in the top row of Figure 3.For both methods, the deviance and error rate decrease with an increasing number of genes considered in the model.We observe that mrMLR consistently performs better than the competitors, with the gap between mrMLR and Group increasing as more genes are considered after screening.As expected, the degrees of freedom increases with larger p, but more gradually for mrMLR.
Next, we repeat the same procedure for n ∈ {10000, 20000, 30000, 40000, 50000} with the number of genes p = 1000 fixed.Results are presented in the bottom row of Figure 3.For both methods, the deviance and classification error rate decrease with an increasing training set size, with mrMLR again consistently outperforming Group.With a training set size of greater than 30000 cells, the rate of increase of the number of distinct parameters tapers off with increasing training set size for mrMLR, but not for Group.
We also compared mrMLR and Group to alternative methods-namely that of Mai et al. (2019), random forests, and hierarchical random forests (Kaymaz et al., 2021).Like mrMLR, the method of Kaymaz et al. (2021) explicitly accounts for known cell type hierarchies: we provided the method of Kaymaz et al. (2021) the same hierarchy as mrMLR (Web Figure 15).We present results of this comparison in Web Appendix F.3.Briefly, both mrMLR and Group significantly outperform the competitors in terms of cell type classification accuracy.

Coherence with existing marker gene sets
We focus on interpreting our estimate of the regression coefficient matrix.We fit the model using mrMLR with p = 2000 and n = 20000 as described in the Section 6.2.In Figure 4, we display the estimated coefficients for genes which are most over-expressed in one cell type compared to the rest of the cell types combined (referred to as "marker genes"), as reported in Hao et al. (2021).That is, we display a submatrix of β corresponding to the genes which are expected to be nonzero based on existing research.In each row, we provide both the gene (e.g., in the first row, MS4A1) and the cell types in which this gene has been reported to be the most over-expressed gene (e.g., in the first row, MS4A1 is over-expressed in both B intermediate and B memory cells).A version of Figure 4 for the entire regression coefficient matrix estimate can be found in Web Figure 6.In Web Figure 13 and 14, we provide plots of the expression of GZMH, MS4A1, and XCL2-three genes in Figure 4-partitioned over cell types they are estimated to distinguish between.For the sake of example, we focus on the first three rows in Figure 4.These rows contain the estimated coefficients for genes previously reported to be over-expressed in subtypes of B cells.We see that the coefficients for these genes are distinct for B cell subtype categories (B intermediate, B memory, B naive, Plasmablast), but are the same within subtypes of monocytes, T cells, NK cells, and dendritic cells.Hence, we would interpret MS4A1, IGHM, and IGHA2 as useful for distinguishing between subtypes of B cells, and for discriminat-Figure 4: Submatrix of β described in Section 6.3.Grey entries are those with estimated coefficient values distinct from all others in their row, whereas entries of the same (nongrey) color have identical coefficient values within a row.For example, in the first row, the coefficients corresponding to B intermediate, B memory, B naive, and Plasmablast are all distinct, whereas the coefficients are the same within subtypes of monocytes, T cells, NK cells, and dendritic cells.Vertical axis labels are genes and in parentheses, the corresponding cell type for which that gene is a "marker gene" in Hao et al. (2021).Four fine cell types (Eryth, HSPC, ILC, Platelet) do not belong to any coarse category: their columns were omitted.
ing between B cells, monocytes, T cells, NK cells, and dendritic cells, but not useful for discriminating between subtypes of monocytes, T cells, NK cells, or dendritic cells.This interpretation coheres with the assignment of these genes as marker genes for B cell subtypes in Hao et al. (2021).We did not use this marker gene information in model fitting.
Some of our results are arguably more informative than the marker gene assignments from Hao et al. (2021).For example, Hao et al. (2021) label GZMH as a marker gene for CD4 + CTL, whereas our method estimates GZMH to distinguish between all T cell subtypes except subtypes of CD4 + naive and CD4 + memory T cells.Such a difference in interpretation follows from the fact that our model-based method accounts for the joint behavior of gene expression in distinguishing between cell types.The published marker genes, in contrast, were identified via univariate differential expression testing after cell type annotation.
To verify that our set of estimated marker genes is competitive with existing marker gene sets, we repeated the analyses from Section 6.2 wherein we used only the marker genes from Hao et al. (2021) as candidate predictors for both mrMLR and RF.Details are provided in Web Appendix F.3.The version of our method which selects genes from a large candidate set performs better than both the methods which use the marker genes from Hao et al. (2021) in terms of classification accuracy.These analyses demonstrate that our method could be used in conjunction with existing methods for identifying marker genes.For example, one could use the method of Dumitrascu et al. (2021) to select marker genes, then input only these genes into mrMLR with γ = 0 so that all marker genes are retained by the model.

Discussion
This work is focused on point estimation for (1) for the sake of classification and coefficient interpretation.A natural question is how to test the structures (sparsity pattern and effect resolution) discovered by our method.Fortunately, due to the relatively large sample sizes in modern single-cell RNA-seq datasets, one can perform likelihood-based inference in a straightforward way using sample splitting.This would proceed in three steps: (i) randomly partition cells into two sets: training and inference; (ii) using mrMLR, fit (1) to the training data, performing cross-validation to select tuning parameters; (iii) test the selected model's sparsity and effect resolution by fitting the constrained MLE on the inference set and applying standard inferential techniques on the fitted model.Although sample splitting-based inference can come at the cost of decreased power, in our real data analysis n is quite large (n > 150000), so we expect sample splitting to provide reasonable power.
Figure 1: (Left) A tree depicting the structure of the coarse categories A 1 , . . ., A 4 .Note that here, we define A k as the set of all fine categories (black leaf nodes) which have A k as an ancestor.(Right) A hypothetical regression coefficient submatrix where each cell represents a regression coefficient.Within each row, a distinct color represents a distinct numeric value.For example, the top row has four distinct coefficient values, and the coefficients corresponding to categories in A 4 are all identical.