Residuals and diagnostics for multinomial regression models

In this paper, we extend the concept of a randomized quantile residual to multinomial regression models. Customary diagnostics for these models are limited because they involve difficult‐to‐interpret residuals and often focus on the fit of one category versus the rest. Our residuals account for associations between categories by using the squared Mahalanobis distances of the observed log‐odds relative to their fitted sampling distributions. Aside from sampling variation, these residuals are exactly normal when the data come from the fitted model. This motivates our use of the residuals to detect model misspecification and overdispersion, in addition to an overall goodness‐of‐fit Kolmogorov–Smirnov test. We illustrate the use of the residuals and diagnostics in both simulation and real data studies.


INTRODUCTION
Use of residuals for model diagnostics is common practice in regression analysis.While straightforward for linear models, visual inspection of residuals from categorical data models is challenging as their distributional properties based on asymptotic theory are often not met [16].
To address these limitations, researchers have developed residuals for binary, count, and ordinal data models that have similar visual and distributional properties as the linear model counterparts [5,13].Dunn and Smyth [5] defined their randomized quantile residuals using the fitted cumulative distribution function (cdf), F(y; μ, φ).For a discrete distribution, they use a uniform random sampling strategy that is mapped to the standard normal, r q,i = Φ −1 (u i ), , where a i = lim y→y i F , and Φ() is the standard normal cdf.Except for sampling variability in the estimated parameters, these randomized quantile residuals are exactly standard normal when the model is correct.
Dunn and Smyth [5,6] demonstrated the randomized quantile residual's ability to better illustrate mean misspecification and lack of fit via residual and quantilequantile plots in both logistic and log-linear regression settings.Feng et al. [7] demonstrated that for count regression models, using these residuals in a Shapiro-Wilk test setting controls type I error and has higher power for detecting model misspecification in comparison to the Pearson or deviance residuals.None of these papers discuss potential extensions to multinomial regression.
Liu and Zhang [13] introduced a similar form of residual called the surrogate residual, the diagnostic utility of which is subsequently established by Greenwell et al. [10].They apply the residual to both binary and ordinal categorical outcome regressions, but do not address modeling of nominal categorical count data.
Their derivation involves using a continuous variable S, which acts as a surrogate for the ordinal response y, with ordered categories 1 < 2 < • • • < J. Ordinal outcomes are generally modeled with the cumulative link model, based on a continuous cumulative distribution function F() and corresponding link function F −1 ().
Letting F(y; ) be the discrete cdf of response y, the form of the surrogate S involves uniformly sampling on the probability scale, The surrogate residual variable is then defined as In practice for case (y i , X i ), the conditional distribution S i |X i is estimated from a fitted model and the residual, rs,i , is based on random sampling from that distribution.The key distinction from the randomized quantile residual, r q , is that Liu and Zhang [13] proposed bootstrapping copies of the empirical distributions of r s .Greenwell et al. [10] went on to show, via implementation of the surrogate residuals in an R package, the ability to detect mean misspecification, heteroscedasticity, and lack of fit via diagnostic plots.
In this paper, we focus on grouped nominal multinomial outcome data, which are represented as vectors of counts.Common models used for analyzing such data include the standard multinomial logit (MLOGIT), and more flexible Dirichlet multinomial (DM) [19] and multinomial logistic normal (MLN) [1].Several variations of deviance-type residuals under the MLOGIT model have been proposed [17,18,23,26].However, these are defined on the J individual categories, producing J residuals per observation.They do not account for the association among category counts and share the visual assessment challenges of the other categorical model residuals.
We propose a method that utilizes log-transformed counts and the squared Mahalanobis distance measure to produce a single residual for each observation.Our proposed residual uses a similar inversion sampling method to Dunn and Smyth [5] and Liu and Zhang [13] based on the empirical cumulative distribution function of the squared Mahalanobis distances.As in the work of Dunn and Smyth [5], the resulting residuals should, under the correct model, follow a standard normal distribution.This allows for easily constructed and interpreted residual and quantile-quantile plots to assess model misspecification and fit.These residuals were briefly described and applied in a previous paper by [9], but are formalized here.
In the next section, we formally define the squared Mahalanobis distance (MD 2 ) residuals.In Section 3, we demonstrate the utility of these residuals in detecting misspecified mean and overdispersion and in testing goodness of fit.We then provide multiple real data examples in Section 4 and conclude with a brief discussion in Section 5.

MD RESIDUALS
Our interest is to produce residuals that provide useful evidence of model misspecification and general lack of fit for a nominal outcome regression model, where  J represents the multinomial distribution with J categories, exposure n i > 1, and fitted probability vector πi .Because these count vectors exist on the simplex S J , where common distance measures can be misleading [2], we transform each vector of counts to exist on R J−1 .We propose the additive log-ratio (alr) transformation [1] to retain the structural relationships between the categories of the count vector y, alr ( )) where y * i is the corrected count vector, described in Section 2.1, such that all y * i, > 0. A residual describes the distance of an observation from what is expected under a fitted model.To account for the associations between category counts, a natural choice for a distance measure of a vector on R J−1 is the squared Mahalanobis distance.The general form for the squared Mahalanobis distance [15] of a vector w is where M is the location (mean) vector and C is its covariance matrix.Since we do not know the fitted distribution of each transformed count vector, we propose constructing the empirical distributions of squared Mahalanobis distances and then use each distribution to assess how far the associated observed vector is from expected.Specifically, for each case i, we propose generating K samples of predicted counts y ik from the fitted multinomial model, applying the correction to each vector and then forming the sampling distribution of the additive log-ratio-transformed vectors, We then calculate the squared Mahalanobis distances of these model-generated log-odds, {w i }, as well as for the observed log-odds, w obs i , and where w i is the sample mean of model-generated log-odds and Σw i is their sample covariance matrix.Finally, we compute a percentile for the observed distance, MD 2(obs)   i relative to the empirical cdf of the model-generated distances, FK . This distance distribution is discrete so randomization is used to smooth away the discreteness via inversion sampling.In other words, we randomly generate a uniform random variable where the minimum, a i , and maximum, b i , depend on the observed value's ordered location among the model-based MD 2 i : ( ) . ( ) .
(3) Otherwise, where If the data fit the model, these percentiles are distributed  (0, 1).As in the work of Dunn and Smyth [5], we propose back-transforming these percentiles to standard normal values to serve as residuals, r i = Φ −1 (u i ).Normal quantile-quantile and residual plots are then used to assess fit.In the appendix, we illustrate that in the binomial setting, this approach reduces to the Dunn-Smyth approach except for variation introduced by the estimation of F.

Handling zero counts
In the presence of zero counts, the log-transformed counts will not exist.Thus, it is necessary to develop a framework for addressing their occurrence.When all n i = 1, model fit could be assessed via grouping techniques [11,14], but there are limited graphical assessment options.When all n i > 1, zero counts are still possible and a correction is needed.
We implement a correction proposed by Smithson and Verkuilen [25].The correction compresses data symmetrically, more heavily penalizing vectors with less information.For a general count vector y of length J and exposure n, the corrected vector is To illustrate, for J = 3 and a count vector y = (2, 0, 3), the corrected vector is y * = (1.93,0.3, 2.73).This correction is applied before the alr transformation to all observed and model-generated counts, thereby allowing residual calculation for any observation with a zero element or a fitted sampling distribution where zeros are non-ignorable.

DIAGNOSTICS
The residuals introduced in Section 2 take their ultimate form as standard normal values.Diagnostics proceed based upon assessment of constant variance across covariates and normality assumptions.To this end, normal quantile plots are used to assess the condition of normality and diagnose model fit.Residual plots of each covariate against the residuals are used to diagnose constant variance.We present scenarios where various multinomial regression models are compared under both the true model fit and misspecified model fits.We provide examples of the residuals' ability to detect misspecified mean structure and the presence of overdispersion and to distinguish between types of overdispersion.In Appendix B, we illustrate the MD 2 residuals utility under different settings for the exposure, n.

Detecting misspecified mean
Consider the MLOGIT model where the data are taken from the multinomial distribution y i ∼  J (n i ,  i ), and covariates are linked to the response through the probability vector, where alr −1 is the inverse additive log-ratio transformation, We simulate a dataset from this model of 1000 observations, each with a constant exposure of n = 200 and J = 3 categories.We consider a single covariate x i ∼  (0, 1) but include a quadratic term so the covariate vector is . The parameters for the simulated dataset are .
To fit the model, we use R software, version 4.1.2,and the MGLMreg function from the MGLM package [21,28].The MGLMreg function can fit the MLOGIT, DM, generalized DM, and negative multinomial models.
Sampling distributions for each observation are generated using the rmultinom function in R, producing K = 10000 samples.The squared Mahalanobis distance residuals are then calculated as described in Section 2.
Figures 1 and 2 illustrate the fit of the residuals for the MLOGIT models with and without the quadratic term, respectively.The figures show the MD 2 residual's ability to identify misspecification of the mean when the quadratic term is not included in the model fit.The incorrect model suggests distances that are on average farther away than expected (i.e., the mean of the residuals is not 0).

Detecting overdispersion
Occasionally, data may arise that exhibit additional variability about the underlying probabilities.One model for such data is the multinomial logistic normal distribution, where a multivariate normal noise term is added to the log-odds so that the probability vector varies by case, MLN data are simulated in the same fashion as those data from the MLOGIT model, except for the removal of the quadratic term and the addition of the noise term.With J = 3, noise is simulated from a bivariate normal distribution, with an identity covariance matrix,  i ∼  2 (0, I 2 ).
The data are fitted first to the MLOGIT model using the MGLMreg function and then used to produce K = 10000 draws for each observation.To fit the MLN model, we employ Bayesian inference under non-informative priors via a Metropolis-within-Gibbs sampler algorithm (see the work of Gerber and Craig [9] for more details).Using the maximum a posteriori (MAP) estimates of the parameters, the same number of samples, K = 10000, are generated for each observation.
Figures 3 and 4 present the QQ plots and residual plots for each situation.The additional variability present in the observed data is not accounted for in the MLOGIT model, and the lack of fit is quite evident.Figure 4 illustrates the MLN model's ability to adequately consider this additional variability.
Another more traditionally studied form of overdispersion in multinomial data is the DM model where the underlying probabilities are assumed to follow the Dirichlet distribution [19].The key distinction between the MLN and DM models is that the MLN model allows for positive correlations between category counts.To demonstrate that our residuals can also detect poor covariance structure fit, we simulate an MLN dataset with positive correlations and fit both the MLN and DM models.The data are simulated the same way, but for a change in the noise covariance matrix: ] .
This covariance matrix corresponds to an implied correlation structure between the count categories ( Y ) that is approximately .12 −.74 1 Thus, there is a small positive correlation between the first and third categories.
As expected, the MD 2 residuals are able to distinguish between the two different types of overdispersion (Figures 5 and 6).Smaller than expected residuals generally indicate that the model is overestimating the amount of variability in the data.While it is difficult to directly compare the variability implied by the DM model and that implied by the MLN model, the residuals illustrate the DM model is inadequate for modeling overdispersion in the presence of positive correlations between categories.

Testing goodness of fit
The diagnostic plots in the previous sections illustrate how the MD 2 residuals graphically assess model fit.It is also desirable to provide an overall goodness-of-fit measure based on the residuals.For this, a test for normality may be used.We use a variation of the Kolmogorov-Smirnov test that quantifies the distance of the empirical distribution of the residuals to the standard normal distribution and provides a p-value that serves as a proxy for determining if the model is misfit.
To illustrate the use of the Kolmogorov-Smirnov test to assess overall model fit, 100 datasets each of m = 100,250, 500,750, 1000 observations under the MLOGIT and MLN models were simulated, using the same settings as in Section 3. Using  = 0.05, the percent of simulations that did not suggest a lack of fit are presented in Table 1.
Data were also simulated under the DM model and the MLN model with positive correlation (MLN+), and the results of the Kolmogorov-Smirnov tests are presented in Table 2.
For the first two scenarios (Table 1), the test easily identifies mean misspecification and overdispersion.In scenarios with a more subtle quadratic trend or less noise (i.e., less overdispersion), the incorrect model is not rejected in some simulations, especially at lower sample sizes.This trend is present in the MLN+ data simulations (Table 2), where the incorrect DM model is found to fit reasonably in simulations when m is small.For the DM data simulations, because the MLN is the most flexible model, it does a good job describing the data for almost all sample sizes.
The Kolmogorov-Smirnov test is only one possible choice for a test of normality for these residuals.Other tests, such as the Shapiro-Wilk or Anderson-Darling, are other options to be investigated.These are generally considered more powerful tests although the results here show this use of the Kolmogorov-Smirnov test to be fairly powerful.

Forest pollen
The forest pollen dataset is available in the MM package in R [3].These data are well-established as overdispersed multinomial data and have been modeled extensively with the DM distribution [19,22].They consist of m = 76 fossil pollen counts of J = 4 different arbor types (pine, fir, oak, and alder) from the Valley of Mexico.Each observation has exposure n = 100.We fit two intercept-only models, the MLOGIT (model A) and the DM (model B), with probability vectors, .
Since these data have overdispersion, we expect the DM model to fit better.After fitting each model using the MGLMreg function and then producing empirical sampling distributions of size K = 10000, the MD 2 residuals are calculated, diagnostic plots are created, and the Kolmogorov-Smirnov tests are conducted.
The diagnostic plots (Figures 7 and 8

MLB Batting Outcomes
The second example uses offensive baseball outcome counts of Major League Baseball (MLB) batters, collected from the Lahman database [8].The data consist of m = 1000 random player-seasons between 1961 and 2020 with at least 200 at-bats.The offensive outcomes considered are those referred to as the "three true outcomes" as well as all other outcomes in a fourth category: (home runs, walks, strikeouts, and others) so that J = 4.The first three outcomes are those that do not usually involve the defense, thereby only depending on the pitcher and batter.The exposures range between 200 ≤ n i ≤ 695.
It has previously been shown that batting outcomes in baseball not only are overdispersed, but also that the three true outcomes are positively correlated [9,20].We would expect that the MLN model should fit these data better than the DM due to the presence of positive correlation.Thus, we again consider two intercept-only models, the DM (model A) and MLN (model B) with probability vectors, , It is clear that there is overdispersion present in these data since both models seem to fit reasonably well.It is difficult to immediately distinguish which type of overdispersion is appropriate to model based on the plots (Figures 9 and 10) although careful examination of the 95% confidence envelopes suggests that the MLN is slightly better.The Kolmogorov-Smirnov test statistics and p-values

Bird Species Counts
Lastly, we investigate the use of covariates to predict species counts of birds at Saxapahaw (NC) observatory.These data are available from the North American Breeding Bird Survey [24], with weather data taken from the Weather History Dashboard of visualcrossing.com[27].
There are m = 22 years of species counts of various birds, covering 1998 to 2019.The survey window ran from May 22 through June 30 each year so the the weather data correspond to these dates.We chose the J = 5 swallows in the dataset: the field, grasshopper, house, song, and chipping.
Each year the total number of swallows ranges between 16 ≤ n i ≤ 46.Species counts are often modeled via multinomial models [4], including data from the North American Breeding Bird Survey itself in the context of estimating observer effects on the counts [12].
Two models are fit, a null intercept MLOGIT model and a MLOGIT model including four covariates (maximum temperature ( • F), number of rainy days, maximum sustained wind (MPH), and the interaction between minimum temperature and number of rainy days) so the probability vectors are The plots (Figure 11) and test (test statistic D A = .292,p-value = 0.037) for the null intercept model indicate some model misfit, with only m = 22 observations, however, it is difficult to assess the type of misfit.The introduction of covariates improves the model fit (Figure 12) and produces a different result from the Kolmogorov-Smirnov test (test statistic D B = .140,p-value = 0.730).Based on the MGLMreg output and Wald tests, the interaction between minimum temperature and number of rainy days is significant, and the inclusion of maximum temperature and maximum sustained wind helps to lower Bayesian information criterion (BIC) and Akaike information criterion (AIC) model selection criteria.

DISCUSSION
The squared Mahalanobis residuals introduced here provide a convenient graphical option for diagnosing fit for nominal multinomial count regression models.The ease of interpretation of the resulting plots is far more accessible than the traditionally used Pearson or deviance residual plots.We show the squared Mahalanobis distance residuals' ability to capture mean misspecification and to diagnose and distinguish between types of overdispersion in a multinomial regression setting.Additionally, Kolmogorov-Smirnov tests may be used as an option for assessing overall model fit based on these residuals.
The results indicate the test sufficiently controls type I error rate and has strong power in detecting the correct model fit.Another benefit of these residuals is their computational simplicity, in which after the model fitting, the  distribution of the response that exists in R, they may be used in other difficult-to-diagnose multivariate regression settings, including continuous and copula-based models.
These residuals serve as another tool to add to the process of diagnosing the fit of multi-class regression models.Additional work is needed to investigate the performance of these residuals under different settings of observations, number of categories, and exposure.Given the decreased utility under smaller exposure settings, grouping of cases could be explored.A more robust study of the Kolmogorov-Smirnov test performance should be conducted.Work could also be done to determine the most efficient value of K, although the cost of this simulation is very low.Finally, the residual being based on a squared distance complicates precise deduction of model misfit from the diagnostic plots, as the left "tail" of the residual distribution represents those observations that deviate very little from the expected.A more rigorous study to understand patterns of different types of misfit is thus called for that will likely include investigations of the fit to each category.
Since there may be zero elements, the Smithson and Verkuilen correction (Section 2.1) is used.Then, the "observed" logits (logit (obs) i ) are calculated the same way using the observed y i .To calculate the MD residual, The percentile of the observed MD obs i is then calculated using the empirical cumulative distribution of K model-generated Mahalanobis distances, FK (MD i ), so that the limits of the uniform random variable u are as follows: (1) When MD (obs) i ≤ min(MD i ), ( (3) Otherwise, We find a near one-to-one correspondence in the bounds of the uniform residuals between the Dunn-Smyth residuals and ours (Figure A1).The limits will not perfectly match because of the additional noise introduced by estimating the empirical distribution function.However, the agreement is indicative of the similarity between the two approaches and helps motivate the use of the Mahalanobis distances when J > 2.

APPENDIX B B.1 Simulation under varying exposure, n > 1
The simulation examples in Section 3 simplified matters by keeping n i constant across all observations.In practice, there are many applications where n i varies; when modeling the human gut microbiome, bacterial counts may be in the thousands, while species counts in ecological studies may produce much smaller exposures.In settings with smaller exposures, there is higher occurrence of zero counts, causing a greater reliance on the correction described in Section 2.1.Here, we illustrate the continued  The only difference in these simulations is the exposure, which instead of fixing to n i = 200 for all observations i, we allow to vary discretely uniform, n i ∼  {2,400}.The MD 2 residual normal probability QQ plots for the four fits are shown in Figure B1, and we illustrate the continued utility of the residuals for assessing model fit when the exposure varies over a relatively large range, including smaller values.

B.1.2. Varying exposure, 2 ≤ n i ≤ 30
We also simulate and fit data under the same settings while restricting the total exposure for each count to be relatively small, such as one might see in species count applications.We should expect the residuals to have a more difficult time in distinguishing the models due to a greater reliance on the zero-count correction.In our simulated data with exposure n i ∼  {2, 30}, while the differences in QQ plots (Figure B2) for the correct and incorrect models are less extreme, it is still a noticeable difference, indicating that the residuals are still useful in settings with low exposure.
U R E 1 (A) MD 2 residual normal probability QQ plot and (B) covariate plot for the correctly specified model fit (with quadratic term) to multinomial logit data.MD 2 residual normal probability QQ plot and (B) covariate plot for the misspecified model fit (lacking quadratic term) to multinomial logit data.
) clearly show that the MD 2 residuals favor the DM.This is supported by the Kolmogorov-Smirnov test results where the MLOGIT (test statistic: D A = 0.334, p-value < 0.001) indicates model misfit, while the DM (test statistic: D B = 0.107, p-value = 0.344) does not.(A) (B) F I G U R E 3 (A) MD 2 residual normal probability QQ plot and (B) covariate plot for the misspecified model fit (multinomial logit model) to multinomial logistic normal data.MD 2 residual normal probability QQ plot and (B) covariate plot for the correctly specified model fit to multinomial logistic normal data.
U R E 5 (A) MD 2 residual normal probability QQ plot and (B) covariate plot for the correctly specified model fit to Dirichlet multinomial data.MD 2 residual normal probability QQ plot and (B) covariate plot for the misspecified model fit (Dirichlet multinomial model) to multinomial logistic normal data.
MD 2 residual normal probability QQ plot and (B) covariate plot for the multinomial logit fit to forest pollen data [19].(A) (B) F I G U R E 8 (A) MD 2 residual normal probability QQ plot and (B) covariate plot for the Dirichlet multinomial fit to forest pollen data [19].(D A = 0.068 and p < 0.001, D B = 0.026 and p = 0.524) corroborate the choice of the MLN model as most appropriate.
MD 2 residual normal probability QQ plot and (B) covariate plot for the Dirichlet multinomial fit to MLB batting outcome data.(A) (B) F I G U R E 10 (A) MD 2 residual normal probability QQ plot and (B) covariate plot for the multinomial logistic normal fit to MLB batting outcome data.
MD 2 residual normal probability QQ plot and (B) covariate plot for the null multinomial logit model (intercept only) fit to bird species count data.U R E 12 (A) MD 2 residual normal probability QQ plot and (B) covariate plot for the multinomial logit model fit with covariates to bird species count data.TA B L E 1Percentage of four model fits, using datasets of varying size m simulated under multinomial logit (MLOGIT) and multinomial logistic normal (MLN) models, which at  = 0.05 in the Kolmogorov-Smirnov test showed evidence that the residuals follow a standard normal distribution, and thus, the model is a good fit.

F
I G U R E A1 Correspondence in (a) lower and (b) upper limits of uniform sampling distribution for Dunn-Smyth randomized quantile residuals and MD 2 residuals in logistic regression framework.F I G U R E B1 MD 2 residual normal probability QQ plots for four model fits under widely varying exposure, 2 ≤ n i ≤ 400.(a) A correctly fit multinomial logit with quadratic term, (b) an incorrectly fit multinomial logit without quadratic term, (c) a correctly fit multinomial logistic normal with additional variability, and (d) an incorrectly fit multinomial logistic normal without additional variability.

FB. 1 . 1 .
I G U R E B2 MD 2 residual normal probability QQ plots for four model fits under low exposure, 2 ≤ n i ≤ 30.(a) A correctly fit multinomial logit with quadratic term, (b) an incorrectly fit multinomial logit without quadratic term, (c) a correctly fit multinomial logistic normal with additional variability, and (d) an incorrectly fit multinomial logistic normal without additional variability.utility of the MD 2 residuals in the simulated setting where n i varies.Varying exposure, 2 ≤ n i ≤ 400 We simulate and fit data under the same parameter settings as in Sections 3.1 and 3.2, corresponding to four model fits: (A) a correctly fit MLOGIT (with quadratic term), (B) an incorrectly fit MLOGIT (without quadratic term), (C) a correctly fit MLN (with overdispersion), and (D) an incorrectly fit MLN (without overdispersion).
Percentage of four model fits, using datasets of varying size m simulated under Dirichlet multinomial (DM) and multinomial logistic normal with positive correlation (MLN+) models, which at  = 0.05 in the Kolmogorov-Smirnov test showed evidence that the residuals follow a standard normal distribution, and thus, the model is a good fit.
TA B L E 2