Misuse of multinomial logistic regression in stroke related health research: A systematic review of methodology

Multinomial logistic regression (MLR) is often used to model the association between a nominal outcome variable and one or more covariates. The results of MLR are interpreted as relative risk ratios (RRR) and warrant a more coherent interpretation than ordinary logistic regression. Some authors compare the results of MLR to ordinal logistic regression (OLR), irrespective of the fact that these estimate different quantities. We aim to investigate the time trends in the use and misuse of MLR in studies including stroke patients, specifically the extent to which (1) the results are denoted as anything other than RRR, (2) comparisons are made of results with results of OLR and (3) results have been interpreted coherently. Secondarily, we examine the use of model validation techniques in studies with predictive aims. We searched EMBASE and PubMed for articles using MLR on populations of stroke patients. Identified studies were screened, and information pertaining to our aims was extracted. A total of 285 articles were identified through a systematic literature search, and 68 of these were included in the review. Of these, 60 articles (88%) did not denote exponentiated coefficients of MLR as relative risk ratios but rather some other measure. Additionally, 63 articles (93%) interpreted the results of MLR in a non‐coherent manner. Two articles attempted to compare MLR results with those of OLR. Nine studies attempted to use MLR for predictive means, and three used relevant validation techniques. From these findings, it is clear that the interpretation of MLR is often suboptimal.

. However, if evidence-based guidelines are to improve patients' outcomes, it is crucial for such research that the statistical and epidemiological methodologies are applied properly and results are interpreted correctly. Previous reviews investigating the methodological conduct of prognosis studies conclude that the research quality within some medical fields is generally poor (Collins et al., 2011(Collins et al., , 2014Mallett et al., 2010). In the field of stroke research, outcomes are sometimes nominal or ordinal. For such outcome types, multinomial logistic regression (MLR) may be used, and ordinal logistic regression (OLR) may be used when the outcome is ordinal. MLR is a standard statistical model used for outcomes of nominal type and is useful for prediction, where the goal is prognosis or diagnosis (Covert et al., 2020;Uijl et al., 2020), estimating adjusted predicted probabilities risks (Wolfe et al., 1999), discrete choice models (Agresti, 2013) and can be used for the inverse probability of treatment weights (IPTWs) with multiple exposure groups (Boesgaard Graversen et al., 2021). However, when used to estimate the effect of exposure, the interpretation of the effect estimates can have implications, as the model parameters represent log relative risk ratios (RRR). Relative risk ratios have a more complex interpretation than odds ratios (ORs) because they are a doubly relative measure, though ORs themselves are not as easily interpretable as risk ratios (RR). While some sources reflect the correct use and interpretation (Duchon, 2014), critical assessment of study results and conclusions is difficult due to the complexity of the model. This leaves uncertainty as to how often MLR is used, with the interpretation of coefficient estimates being flawed. Moreover, healthcare professionals may have uncritically applied study results in clinical guidelines, which potentially leads to treatment choice and patient diagnosis, supported by flawed conclusions in research using MLR.
An alternative to MLR is the so-called OLR model, which can be used when the outcome variable is ordinal and can assume more than two values (Kleinbaum & Klein, 2010). OLR can be used to estimate ORs and relies on the so-called proportional odds assumption (Gelman & Hill, 2006;Kleinbaum & Klein, 2010), an assumption not used for MLR. Another misuse of MLR has been seen when it is used to validate the proportional odds assumption (Topriceanu et al., 2021). Furthermore, some authors compare the results of MLR and OLR, irrespective of the fact that these estimate different quantities.
In the current study, we seek to understand the reach of misconduct when applying MLR to comparative studies in stroke-related medical sciences. Hence, we aim to investigate the time trends in the use and misuse of MLR in studies that include patients with stroke. Specifically, we aim to assess to what extent the included studies: 1. Denote the exponentiated coefficients as RRR or equivalent? 2. Compare the results of MLR to those of OLR? 3. Interpret the results of the MLR model coherently, such that the RRRs for the same exposure groups are interpreted together?
Additionally, we are interested in the use of MLR for prediction and examine whether studies with a predictive aim use any form of data split, external validation or similar to assess the models' predictive ability Moons et al., 2009;Steyerberg & Vergouwe, 2014).

| BACKGROUND
Logistic regression models have been widely used in the medical sciences and still are. Using logistic regression, it is possible to model the effect of one or more binary and continuous explanatory variables on a binary outcome variable. Logistic regression offers ease of interpretation of the estimated effect, as both risk and ORs can be derived from the estimated coefficients. Logistic regression can be extended to the case where the outcome variable is nominal with more than two levels. This model is called the MLR model, and the extension results in a more complex model that has a wider application, but this comes at the expense of interpretability when comparing outcomes between exposure groups. The MLR model can, under some assumptions, be used to estimate the adjusted predicted probabilities of outcomes in observational studies where covariate adjustment is necessary; however, the model coefficients do not necessarily quantify the effect of an exposure on the outcome.
Suppose Y is a nominal outcome variable with k levels, then the MLR model can be expressed as, x m are explanatory variables (Gelman & Hill, 2006;Kleinbaum & Klein, 2010). Note that coefficients are specific to the i-th outcome level, and thus ðk -1Þ different sets of m þ 1 regression coefficients are estimated in the model. When using MLR, one must specify a reference level, which is Y ¼ 0 above, but this choice is entirely arbitrary. In practice, any category may be chosen as the reference category, and the coefficients change accordingly.

| Interpretation of the coefficients
In the commonly used logistic regression model, where the outcome variable is binary, the exponentiated coefficients are interpreted easily as ORs. For MLR, this interpretation has been reused in some studies (Matsukawa et al., 2013;Rist et al., 2013;Zinkstok et al., 2014). However, the exponentiated coefficients cannot be interpreted as ordinary ORs but should rather be interpreted as a ratio of relative risks. This can be seen in the case where m þ 1, because where the left-hand side is a ratio of the relative risks of being in the outcome category i versus in the reference category. As such, the exponentiated coefficients should be interpreted and referred to as RRRs. Denoting exponentiated coefficients as either OR, risk ratios (RR) or any other measure is therefore misleading and may lead to a faulty inference of effects. We instigate the extent of this denotation error in answering our first question. Additionally, the complexity of MLR raises another issue connected to interpretation. A RRR above one may be interpreted as increased risk in the exposure group compared with the unexposed; however, a RRR above one might as well be caused by a reduced risk in the reference group as a result of an increased risk in one of the competing outcome groups. Most likely, the cause of the RRR above one is a combination of both scenarios, which warrants a coherent inherent interpretation of the RRRs of all outcome categories; that is, all coefficients of a variable must be interpreted together. The complexity of interpretation increases with the number of outcome groups, and even a RRR equal to one cannot uniquely be interpreted as equal risk between exposure groups. An example of one of these scenarios can be seen in Figure 1, where the risk of Y ¼ 1 is the same in each exposure group but the RRR is above one. This is a result of a reduction in risk for Y ¼ 0 and an increase in risk for Y ¼ 2, respectively, between exposure groups, and thus any interpretation of the RRR for Y ¼ 1 versus Y ¼ 0 must be coherent with the interpretation of the RRR for Y ¼ 2 versus Y ¼ 0. The histogram can be extended to arbitrarily many outcome levels, and interpretation would likewise be more complex. Moreover, any coherent interpretation of the results of MLR must recognize the doubly relative nature of RRR. In the case where Y is binary, MLR reduces to OLR.
We instigate the coherency of interpretation in answering our third question.

| Predicted probabilities
MLR may be used to derive both adjusted predicted probabilities and conditional predicted probabilities. While conditional predicted probabilities are useful for estimating risk differences within certain strata, adjusted predicted probabilities are useful for estimating marginal risks for each exposure group and outcome class. Deriving predicted probabilities without adjusting for any covariates is equivalent to making a J Â K contingency table with J outcome classes and K exposure groups. We have deemed the use of MLR for estimating adjusted/ conditional predicted probabilities legitimate but do not directly investigate the extent of this usage.

| OLR
When the outcome variable is ordinal, instead of MLR, one might consider the OLR model, from which ORs can be estimated. OLR can be used when an outcome variable is ordinal and has more than two levels. Suppose Y is an ordinal outcome variable with k levels. OLR can be expressed as: This stacked histogram shows a scenario where the overall risk of Y = 1 is the same in both the unexposed and exposed groups (x 1 denotes the exposure group). Interpreting the exponentiated coefficients for x 1 separately would lead researchers to infer an increased risk for Y = 1 in the exposed group, even though the risk is the same in both groups. Interpreting coefficients for x 1 for both Y = 1 and Y = 2 relative to the reference group together, one might be able to deduce that the decrease in RRR for Y = 1 is caused by an increased risk for Y = 2.
where i is some value of Y we will call a cutoff value and x 1 ,…, x m are explanatory values (Kleinbaum & Klein, 2010). Contrary to MLR, OLR estimates the association between covariates and the log odds of Y being less than or equal to some cutoff i. Furthermore, note that the coefficients, except the intercept, are not dependent on the cutoff being considered. As such, the coefficients express a change in log odds for a unit change in a given covariate, irrespective of the cutoff value being considered, implying the assumption that this change is the same for each cutoff value. This assumption is called the proportional odds assumption, and it is central to OLR. Inspection of the proportional odds assumption can be done in various ways, including comparison of ORs for all possible cut-points and a Brant test (Brant, 1990). Inspection of the proportional odds assumption has also been attempted by comparing the ORs of the OLR with the RRR of the MLR (Topriceanu et al., 2021). However, such comparisons lead to faulty conclusions, as comparing coefficients that are not representations of the same measure of association is not meaningful.
An alternate model to OLR is a generalization called the generalized OLR model (GOLR), which may be used. This model does not rely on the proportional odds assumption and thus estimates multiple sets of coefficients, similar to MLR, though a comparison to MLR is still ill-advised. However, the increased complexity of GOLR complicates the interpretation of inferred effects.

| Illustrative cases
To further illustrate an important point of this review, namely, that the parameters of MLR and OLR cannot meaningfully be compared, we propose two examples. The first example uses registry data from the Danish National Patient Registries (usage approved by the Danish Data Protection Agency), and the second example involves a hypothetical situation. The contingency table for both examples is presented in Table 1. Analyses are performed in R (R Core Team, 2018).
Case 1: Association between mental disorders and stroke or death.
For the first example, we consider a cohort of 22,719 patients who have turned 60 years of age on or prior to 2009 and who have been diagnosed with atrial fibrillation (AF) (ICD-10: I48) at some point after turning 60. We then compare patients who have been diagnosed with severe mental disorders (ICD-10: F20, F22, F25, F30 or F31) with those who have no history of mental disorders on death and stroke (ICD-10: I60, I61, I63 or I64) within 2 years after they have been diagnosed with AF.
The proportions of death and stroke are summarized for this cohort using a contingency death. However, these results arise because of the larger group of patients that remain stroke-free and survive compared with the reference group.
In this example, the outcome variable is the patients' health state, which can assume progressively worse outcomes classes. As such, it can be considered as an ordinal variable, and we may model the associations between the outcome and mental health disorders with an OLR model from the ordinal package in R. Using the OLR model yields a single parameter estimate for the exposure, which is assumed to be equivalent for all cutoffs of the outcome variable (Table 1). This is often called the proportional odds assumption. If, instead, we use a GOLR model also from the ordinal package in R, which does not assume proportional odds, we can estimate the same number of parameters as with MLR ( and 'dead', respectively. These estimates are identical to those of the GOLR model, with some rounding error. From the results of the GOLR, we see that the proportional odds assumption is likely to hold in this example, while the results of the MLR preserve the direction but not the magnitude. None of the analyses captures the elevated risk of stroke for patients with severe mental illness. This is only seen by considering the contingency table.
Case 2: A hypothetical example In the second example, we see an elevated risk of cancer and a decreased risk of death in a group of smokers compared with non-smokers. Again, the RRR estimates of the MLR do not capture the decreased risk of death because both RRRs are above 1.0 (Table 1). Moreover, the example reveals that the RRR estimates of the MLR are not comparable to the OR estimates of the GOLR and OLR because the estimates differ in both magnitude and direction. Both examples highlight the difficulties in interpreting RRR. However, we stress that the MLR can be used to derive an adjusted contingency table using predicted probabilities.

| METHODS
We identified articles concerning the use of MLR in analysis, with stroke patients as the population, through a systematic literature search. The databases PubMed and EMBASE were initially searched on 29 August 2019, and a final search was conducted on 29 September 2021, using the search string described in Appendix A. Articles were included regardless of publication year.

| Inclusion/exclusion criteria
Initially, we included all articles that applied MLR to a population, which included stroke patients as either exposure or outcome. Moreover, studies that aimed at investigating or in any way reflected on the associations represented by the RRRs of MLR were subsequently included. Thus, prediction (prognosis or diagnosis) studies where the parameters of the MLR are not referred to as associations between an independent variable and the risk/odds of a response were excluded. Likewise, studies where the MLR was only used for IPTW were excluded. Studies deriving only adjusted/ unadjusted predicted probabilities where no comparisons between exposure groups are made were also excluded. Studies in another language than Danish or English or studies with an unclear methodology were also excluded.

| Data extraction, analysis, and reporting
The titles and abstracts of all articles identified by the search string (Appendix A) were screened by two of the authors (LRR and AHS) to exclude articles not pertaining to multinomial logistic models or populations, including stroke patients. Studies not excluded after the screening of titles and abstracts were screened again on a full-text basis by the two other authors (JBV, NF). Lastly, some articles were identified by hand. These were screened in the same manner as the articles found in the systematic search ( Figure 2).
We have reported our systematic review in accordance with the PRISMA 2020 guidelines, except for the items relating to meta-analysis, which are not relevant in the context of this review.
The included articles were carefully examined for information about whether they: 1. Denoted exponentiated coefficients as anything other than RRR. 2. Compared MLR results to results of OLR. 3. Interpreted MLR results in a coherent manner.
Additionally, studies with predictive aims were reviewed to ascertain whether any form of data split or similar was used to assess the models' predictive ability. Using a random data split is in general ill-advised, as this method is inefficient and inferior to cross-validation and bootstrapping methods. We do not consider the use of apparent validation, that is, where a prediction model is validated on the same data as it has been trained, as a valid validation method. Furthermore, the publication year for each article was determined in order to assess the time trends in the use of MLR. The information extracted from each article is presented in Appendix B.

| RESULTS
The search string resulted in 227 articles in PubMed and 267 articles in EMBASE. After removing duplicates, our search yielded 285 articles (Figure 2).
Following the screening of titles and abstracts, 52 articles were included for full-text screening. A further 28 articles were found by hand-searching reference lists.
In total, 68 studies met our inclusion criteria and were eligible for inclusion in the review. The included studies were published between 1988 and 2021, and 58 studies (85%) were published between January 2010 and September 2021 (Figure 3).

| Denoting the coefficients
As suspected, coefficients of MLR were seldom referred to as RRRs, with 60 articles (88%) referring to them as some other measure of risk (Table 2). OR interpretation was by far the most popular way to interpret coefficients, with 50 articles (74%) interpreting the exponentiated F I G U R E 2 Flow diagram of the screening process for studies identified in the systematic literature search.
Two articles by the same author used MLR, but recognized the difficult interpretation of MLR and subsequently abandoned interpretation in favour of deriving adjusted predicted probabilities (Buntin et al., 2005(Buntin et al., , 2009). As such, these articles did not denote the exponentiated coefficients as either RRs, ORs or RRRs, although it should be noted that the results of the MLR were reported in supplements, which seem to be unavailable. F I G U R E 3 This graph shows the publication year of the included studies. As can be seen on the graph, the use of multinomial logistic regression (MLR) seems to be more frequent starting from 2012, with a large spike in usage in recent years.
T A B L E 2 Proportion of articles identified for each objective.

| Comparisons with OLR
Out of the 68 included articles, only two articles (3%) compare results from MLR to OLR with the purpose of validating the proportional odds assumption (Johnson et al., 2015;Lipsman et al., 2009).

| Coherency of interpretation
As seen in Figure 1 and earlier discussions, it is important to interpret the coefficients in a coherent manner and not interpret each coefficient independently. Of the 68 included articles, only five studies (7%) interpreted the coefficients coherently (Buntin et al., 2005(Buntin et al., , 2009Hendrickx et al., 2021;Macintosh et al., 2021;Williams et al., 2021). Additionally, three studies (4%) did not explicitly interpret MLR results for unspecified reasons, although the studies aimed to do so (Lipsman et al., 2009;Nagai et al., 2011;Uijl et al., 2020). Studies varied in the degree of coherency in their interpretation. As mentioned previously, two studies abandoned all interpretation of results in recognition of the difficult interpretation of MLR (Buntin et al., 2005(Buntin et al., , 2009). We counted these as having a coherent interpretation. Three studies are identified as having a coherent interpretation to some extent, as they recognized the reference outcome category (Hendrickx et al., 2021;Macintosh et al., 2021;Williams et al., 2021).

| Use of data split in studies with a predictive aim
As the secondary aim of this study, we examined the use of data splits or similar methods in studies with predictive aims. Among the included studies, nine aimed to use MLR for predictive purposes. Two studies applied data splits or other methods, where Covert et al. (2020) used cross-validation andUijl et al. (2020) validated their model using an external dataset. The remaining studies applied Akaike's information criterion (Hamaguchi et al., 2020).

| DISCUSSION
In this review, we have formulated three problem areas regarding the use of MLR based on inadequacies seen previously in studies in this area. Investigation of these problems in studies involving stroke patients identified in a systematic literature search shows that some of these problems are common when studies use MLR to compute association ( Table 2).
The denotation of exponentiated coefficients from MLR in studies involving stroke patients as anything other than RRRs appears to be exceedingly common. We speculate that much of this misinterpretation is a legacy of the usage of binary logistic regression because the clear majority of studies denoted the results as OR. This is unfortunate, as the wrongful denotation of the exponentiated coefficients may lead to a wrongful interpretation of their meaning. Specifically, denoting the exponentiated coefficients as OR may lead to interpretations such as an increase or decrease in risk or odds, which we have shown is erroneous. The wrongful denotation of the results of MLR speaks to a misunderstanding of the model itself. Authors should be aware of the results of their model of choice, specifically what measures of effect these can be meaningfully interpreted as. As such, authors must be aware that MLR estimates RRRs and also be aware of their dually relative nature.
We specified a second problem area pertaining to comparisons with OLR because examples were presented where the results of MLR were compared with those of OLR. In the included studies involving stroke patients, this problem was not common, with two studies (Table 2) making this comparison. As previously discussed, making this comparison is not meaningful, and most authors did not attempt to do so.
As our third objective, we set out to investigate how the results of MLR were interpreted. We have described how the use of MLR necessitates a coherent interpretation of the estimated coefficients, as the method is significantly more complex than, for example, binary logistic regression. Coefficients are specific to the outcome category, resulting in multiple estimated coefficients for each covariate in the model. Each coefficient is therefore influenced by the coefficients of other outcome levels, and an interpretation taking this into account is necessary. Only five studies displayed coherency in interpretation, with the clear majority of studies interpreting the coefficients separately. This is problematic as single estimates provide limited information about risks, and thus these are given too much significance in many studies. In a worst-case scenario, this may lead to the wrong conclusions being drawn.
Two studies (Buntin et al., 2005(Buntin et al., , 2009) included in our review did not display any of the previously described problems. Both studies displayed an understanding of the complexities of MLR and, resultingly, abandoned the interpretation of the estimated coefficients. Instead, the authors opted to derive adjusted predicted probabilities, which can be done from the results of MLR (Greene, 2012). The fact that only two studies did not display the problems we have described is not surprising, as authors familiar with the challenges of MLR expectedly tend to use other models. Contrastingly, one study (Soroush et al., 2019) interpreted the results of MLR only through p-values and made no attempt at interpreting the effect estimates or the relevant confidence intervals. Consequently, this study employs the least coherent interpretation seen in the studies included in our review.
It is not clear how these results relate to existing research; to our knowledge, no other study has examined the extent of the previously described challenges with MLR.
None of the studies identified in this review use MLR to derive conditional predicted probabilities or discrete choice models.
The secondary aim of our review was to examine the use of data splits or similar methods to validate MLR models used for prediction. In this review, only a few studies (n = 9) used MLR for prediction, with only three studies (33.33%) validating the model using data split, external validation, cross-validation or similar methods. This proportion is lower than what has been found in other reviews (Bouwmeester et al., 2012;Collins et al., 2011Collins et al., , 2013Collins et al., , 2014, though methodology among these varies. One review (Collins et al., 2014) excludes studies that do not conduct external validation of prediction models, and it is evident from the flow diagram that 43 articles are excluded because they use random data splits, cross-validation or bootstrapping. Disregarding studies exclusively performing external validation of existing models and counting excluded studies using internal validation, the proportion of studies using data splits, external validation or similar would be more than half of the included articles. Similarly, a review regarding the methodology and reporting in the development of prediction models for type 2 diabetes (Collins et al., 2011) found that 38% of the included studies used internal validation of the model and 54% used external validation. Note that this review did not include studies performing external validation of an already existing model. A similar review of studies of chronic kidney disease (Collins et al., 2013) found that, of the 11 included studies, 73% used internal validation by data splits or similar methods, and additionally, 36% of studies used external validation of the model. This review did not include studies that performed external validation of an already existing model. Lastly, a review examining the reporting and methodology of studies in clinical prediction research (Bouwmeester et al., 2012), which included both model development studies and predictor-finding studies, found that the use of external validation was 27% and 0%, respectively. Likewise, the use of internal validation was infrequent, but in total, the proportion of studies that used data splits, external validation or similar methods was larger than seen in our review.
The low proportion of studies using model validation methods seen in our review may be explained by multiple factors. Firstly, only a small number of predictive studies were identified by our review. Secondly, the use of data splitting techniques and external validations seems to vary between diseases (Collins et al., 2011(Collins et al., , 2013. Lastly, the inclusion of either model development studies and/or studies identifying predictors seems to influence the findings (Collins et al., 2011).

| Limitations
Our systematic search and review were limited to articles in either English or Danish and only considered studies that were peer-reviewed. Additionally, we did not have the opportunity to do a full-text search for all articles, and it is likely that a large portion of studies that apply MLR do not state so in the title or abstract. As such, it cannot be ruled out that studies could have been missed. We expect that if these articles had been identified, the problems raised in this review might be more prevalent. Therefore, we may have underestimated the extent of these methodological problems.
This review detailed studies involving stroke patients as either a population or an exposure. It is unclear if the extent of the problems we have discussed in this review is similar when MLR is used on other patient populations, but we suspect this to be the case.

| CONCLUSIONS
This systematic review identifies 68 articles selected for inclusion. Based on the three questions that we sought to answer, we have found significant problems in the usage and interpretation of MLR. Study authors routinely mislabel and fail to coherently interpret the effects inferred by MLR.
Consequently, the risk of misinterpreting the results from applying MLR and drawing conclusions that the data does not support is high.
Contrastingly, the use of comparisons of MLR and OLR results is not widespread.
Our findings show that the use of MLR to estimate and interpret associations is difficult, and based on the results of this review, authors should be careful when interpreting the results of MLR. Considering the year of publication of the included studies, it is evident that most studies identified in this review have been published in recent years and that the use of MLR seems to be increasing (Figure 3). Thus, the challenges we have identified are increasingly problematic and relevant.
It is important to state that while MLR may necessitate a too complex interpretation, the model still has merit when used for prognostic or diagnostic purposes, performing IPTW in data with a nominal exposure variable, using discrete choice models or computing adjusted predicted probabilities.

AUTHOR CONTRIBUTIONS
Line Ryberg Rasmussen and Amalie Helme Simoni conducted the systematic literature search and screened the articles based on abstracts and titles. Nicholas Fitzhugh and Jan Brink Valentin conducted the full-text screening and data extraction. Nicholas Fitzhugh drafted the manuscript, and Jan Brink Valentin contributed with inputs. All authors subsequently reviewed the draft and provided feedback.

ACKNOWLEDGEMENTS
Not applicable.

CONFLICT OF INTEREST STATEMENT
The authors declare that they have no competing interests.

DATA AVAILABILITY STATEMENT
The data supporting the conclusions is available in the article.