Lymph node metastases in breast cancer: Investigating associations with tumor characteristics, molecular subtypes and polygenic risk score using a continuous growth model

We investigate the association between rate of breast cancer lymph node spread and grade, estrogen receptor (ER) status, progesteron receptor status, decision tree derived PAM50 molecular subtype and a polygenic risk score (PRS), using data on 10 950 women included from two different data sources. Lymph node spread was analyzed using a novel continuous tumor progression model that adjusts for tumor volume in a biologically motivated way and that incorporates covariates of interest. Grades 2 and 3 tumors, respectively, were associated with 1.63 and 2.17 times faster rates of lymph node spread than Grade 1 tumors (P < 10−16). ER/PR negative breast cancer was associated with a 1.25/1.19 times faster spread than ER/PR positive breast cancer, respectively (P = .0011 and .0012). Among the molecular subtypes luminal A, luminal B, Her2‐enriched and basal‐like, Her2‐enriched breast cancer was associated with 1.53 times faster spread than luminal A cancer (P = .00072). PRS was not associated with the rate of lymph node spread. Continuous growth models are useful for quantifying associations between lymph node spread and tumor characteristics. These may be useful for building realistic progression models for microsimulation studies used to design individualized screening programs.


| INTRODUCTION
Breast cancer is a heterogeneous disease. Different subtypes of breast cancer grow and spread at different rates, and they react differently to treatment. Recently, there has been an interest in statistically modeling breast cancer heterogeneity in terms of disease progression. 1,2 Number of lymph node metastases present at diagnosis is associated with long-term breast cancer prognosis. 3,4 It is therefore clinically relevant to understand breast cancer heterogeneity in terms of lymph node metastases at diagnosis. The purpose of this article is to investigate the association between breast cancer tumor characteristics, including molecular subtype, and rate of lymph node spread.
In 2018, the Cancer Intervention and Surveillance Network (CISNET), a consortium of research groups from six different universities, evaluated the contributions of screening and treatment to breast cancer mortality between 2000 and 2012 by molecular subtype, based on estrogen receptor (ER) status and human epidermal growth factor receptor 2 (HER2) status. 1 The group estimated that in 2012, compared to baseline mortality rates, total reduction in mortality rate from interventions was 49%, ranging from 37% for ERÀ/HER2À breast cancer to 58% for ER+/HER2+ breast cancer. The contributions of screening and treatment differed substantially between molecular subtypes. Screening was estimated to contribute to 31% of the total mortality reduction for ERÀ/HER2À and to 48% of the reduction for ER+/HER2+ breast cancer. Rueda et al 2 investigated the rate of recurring breast cancer by breast cancer molecular subtypes. They used a semi-Markov model; molecular information was based on PAM50 subtypes 5 and integrative subtypes. After surgery, state transition for local recurrence differed across the PAM50 molecular subtypes: Basal-like breast cancer predominantly recurred within the first 5 years, whereas luminal A breast cancer recurred almost uniformly throughout the 20-year study period. Some of these differences will be due to heterogeneity in rates of breast cancer spread.
Each group in CISNET has developed a breast cancer natural history model. 6 These models are all stage based and include a localized tumor stage, a regionally spread stage and distant metastatic stage. Of these approaches, the University of Wisconsin group uses, arguably, the most sophisticated stage model: a continuous time spread process based on Shwartz. 7 The model assumes that tumor volume follows an exponential Gompertz function with decelerating doubling time, individually assigned doubling times, and that the instantaneous rate of lymph node spread at time t is equal to λ(t) = b 1 + b 2 V(t) + b 3 V 0 (t), where V(t) is tumor volume at time t, V 0 (t) is the rate of growth at time t, and b 1 , b 2 and b 3 are constants. In Isheden et al, 8 it was shown that the model of Shwarz suffers from two weaknesses: firstly, it implies that slow growing tumors have a higher degree of lymph node spread compared to fast growing tumors, and, secondly, the model implies either an unrealistically high degree of lymph node spread for large tumors or an unrealistically low degree of lymph node spread for small tumors. Based on two independent data sets, it was shown that lymph node spread following an inhomogeneous Poisson process with rate λ(t) proportional to the number of times the tumor cells have divided, D(t), to the power four, and the rate of cell division in the tumor D 0 (t), that is, λ(t) = σD(t) 4 D 0 (t), combined with a gamma distributed random effect for individual spread σ, gives a significantly better model fit compared to the model of Shwartz,7 and the lymph node spread model of Hanin and Yakovlev. 9 Here, we base our analyzes on the lymph node metastases modeling approach of Isheden et al 8

and a
recent extension of the model to include a covariate effect on the rate of lymph node spread. 10 In this article, we use a natural history lymph node spread regression model to quantify the rate of lymph node spread based on grade, ER status, progesteron receptor (PR) status, molecular subtype and polygenic risk score (PRS).

| Data
We include two independent data sources for our study: the Cancer and Hormone Replacement Study, CAHRES; and breast cancer cases from the Stockholm-Gotland regional breast cancer register, here abbreviated as ST01-08. Ethical approvals were obtained for both data sources.
CAHRES is a case control study, consisting of all Swedish born women between the ages of 50 and 74, who were diagnosed with invasive breast cancer in Sweden from October 1993 to March 1995. The study had a participation rate of 84% (n = 3345), and patients were matched to randomly selected controls from the general population based on the expected age frequency distribution of the cases. For the purpose of our study, we use only the cases. Information on tumor size, degree of lymph node spread, grade, ER status and PR status was collected from the Swedish Cancer Registry and the Stockholm-Gotland Breast Cancer Registry. The collection of this data has been described previously by Rosenberg et al 11,12 and Eriksson et al. 13 Tumor size was categorized into millimeter diameter intervals, lymph node involvement categorized according to number of lymph nodes affected by metastases and grade categorized as 1, 2 or 3. Tumors were considered ER or PR positive if they contained at least 0.05 fmol receptor/μg DNA or at least 10 fmol receptor/mg protein. We excluded women if they did not provide written consent, had missing tumor size, missing lymph node status, had a tumor diameter larger than 80 mm or smaller than 1 mm or had more than 30 affected lymph nodes at diagnosis. The total number of women eligible for analysis based on these criteria was 2874, with 1928 having data on grade, 2082 on ER status and 2039 on PR status.
PRSs for a selection of women in CAHRES were available through an extension of the original study. 14 In this extension, 1500 women were randomly selected, together with all women who had taken hormone replacement therapy (191 cases) and all women with self-reported diabetes mellitus (110 cases). These women were contacted by mail and those who consented were given blood sampling kits to be used at their primary health care facility. From all deceased breast cancer cases, attempts were made to retrieve archived tissue samples. Blood samples were collected from 1322 cases and archived tissue was collected for 247 cases (85% of all selected). DNA was isolated from 3 mL of whole blood and from nonmalignant cells in the paraffin-embedded tissue samples. DNA samples were genotyped on a custom Illumina iSelect genotyping array (iCOGS). 15 ST01-08 consists of all women diagnosed with invasive breast cancer in Stockholm from 2001 to 2008. Women were identified through the Stockholm-Gotland Regional Breast cancer quality register, and information was collected on tumor size, lymph node involvement, grade, ER status and PR status. 16 Tumor size, lymph node and grade were categorized in the same way as in CAHRES. ER and PR status were determined using radioimmunoassay or immunohistochemistry (IHC) with cutoff values of more than 10% positive cells for IHC and more than 0 fmol/μg DNA for radioimmunoassay assays, and categorized as negative or positive. We excluded women if they had missing tumor size, missing lymph node status, a tumor diameter larger than 80 mm or smaller than 1 mm, or if they had more than 30 affected lymph nodes at diagnosis. This left a total of 8076 women eligible for analysis. Less than 2% of patients had missing data on tumor size and lymph node involvement. Twenty percent of patients had missing data for ER and PR status. Grade was included in the register from 2004, with 7% of patients having missing data. After exclusions, the final numbers of available women with data on grade, ER status and PR status were 5227, 6518 and 6385, respectively.
All women in the Stockholm-Gotland Regional Breast Cancer quality register still alive in 2009, diagnosed with invasive breast cancer between 2001 and 2008, and younger than age 80 at diagnosis were invited to participate in a study named Libro-1. Invitations were mailed out in 2009, and 62% (n = 5715) consented to take part in the study. These women gave blood specimens for genetic analysis. Of these, 5125 were successfully genotyped in a large-scale genotyping study on breast cancer risk. 17 Five thousand one hundred and twenty-two had enough remaining DNA for mutation testing using targeted sequencing. For the women in the Libro-1 study, data on molecular markers were retrieved in 2015 and 2016, from medical and pathology records at treating hospitals. From these, molecular subtype was assigned based on age at diagnosis, ER, PR, HER2 and Ki67 status using a random forest algorithm. 18 After applying the exclusion criteria, we were left with 1749 patients with data on molecular subtype. Our study was carried out with informed consent and ethical approvals from the Swedish ethical review board.

| Polygenic risk score
We constructed a PRS based on 158 single-nucleotide polymorphisms (SNPs) that were genotyped, or that could be imputed based on neighboring SNPs, in both studies. SNPs were chosen based on published studies on breast cancer risk. The PRS was constructed by summing the number of alleles of each SNP, weighted by per-allele odds ratios for breast cancer. Per allele odds ratios were taken from published studies, for example, Michailidou et al. 19 A PRS was thus calculated for each individual using the formula Where β k is the per-allele log odds ratio for breast cancer associated with the minor allele for SNP k, x k = 0, 1 or 2 is the number of minor alleles for the same SNP, and n = 158 is the total number of SNPs.
After exclusions based on tumor and lymph node data, the PRS could be calculated for 1119 of the available cases in CAHRES, and 4150 of the available cases in Libro-1/ST01-08.

| Statistical models
To Where s * is a gamma distributed random effect, D(t,r) is the number of times the cells in the tumor has divided and D 0 (t,r) is the rate of cell division in the tumor-both at time t, assuming an inverse growth rate r. In Isheden et al, 8 it was shown that these models lead to there being, at any time point, a negative binomial distribution, for the number of affected lymph nodes N, such that the probability of n affected lymph nodes, conditional on current tumor volume V, follows the functional form where V 0 is the minimal volume of a detectible lymph node metastasis (here assumed to be 0.5 mm), Γ Á ð Þ represents the gamma function, and γ 1 and γ 2 are the parameters of the gamma distributed random effect. The authors further showed 10 that the association between a covariate X and breast cancer lymph node spread can be modeled by assuming that the rate of lymph node spread in the underlying dynamic model of spread, during the pre-clinical phase, is amplified where β ¼ ðβ 1 ,β 2 , … β n Þ is a vector with the values of the covariate effects. When combined with the other assumed models, this leads to a negative binomial distribution, with number of affected lymph nodes N = n, given tumor volume V = v and covariate X, following Lymph node spread at diagnosis can also be affected by the tumor growth rate of the tumor. A faster growing tumor will result in a larger tumor volume at diagnosis, consequently leading to more lymph node spread at diagnosis. This can be accounted for by making a regression of tumor characteristics on the growth rate of the tumor, as was done in Isheden et al. 10 However, this requires screening data that we do not have for all of our study population. In our study, we therefore focus on the contribution to the rate of breast cancer lymph node spread. In the following sections, we use this likelihood to make inference on the effects of tumor characteristics on rates of breast cancer lymph node spread using data on tumor characteristics, tumor volume and number of lymph node metastases at diagnosis.
In addition to describing our approach in the Appendix S1, we also include a table (Table 1), summarizing the key characteristics/ assumptions of our approach and the data used to make inference.

| Point estimates and confidence intervals
Point estimates are calculated using maximum likelihood estimation, where the likelihood is based on the probability of N = n affected lymph nodes, conditional on the tumor volume V = v and the covariate of interest X = x: p n = P(N = njV = v, X = x). p n is calculated using Equation (5). 95% confidence intervals are estimated from 2000 bootstrap replicates using the percentile method.
We model the effect of grade on rate of lymph node spread in two different ways: firstly by modeling the effect of grade as an ordinal variable, so that the rate of lymph node spread is amplified by the factor e βg , where β is the log effect and g is the grade; and secondly, as a discrete variable, so that the rate of lymph node spread is amplified by the factor e β 2 g 2 þβ 3 g 3 , where Grade 1 is the reference, β 2 and β 3 correspond to the log effects of Grades 2 and 3, respectively, and g 2 , The rate ratio, that is, the ratio of the rate of lymph node spread (at all points in time during the cancer's preclinical phase) between two different tumors with different covariate levels X = x 1 and X = x 2 , assuming the same tumor volume V, inverse growth rate R and spread parameter s * , is calculated as

| Tumor size, number of affected lymph nodes and lymph node positivity
We first calculated the fractions of patients with tumor diameters less than 10, 10 to 19, 20 to 29, and more than 30 mm. In the CAHRES data set, these fractions were 19%, 45%, 22% and 14%, and in the ST01-08 data set, they were 18%, 46%, 23% and 13%. The percentage (unit) difference between the two data sets is less than 1% for all four size categories. In Figure 1, we show histograms of tumor diameters, divided into 10 mm intervals, for CAHRES and ST01-08. The fractions of patients with no affected lymph nodes, one affected lymph node, two affected lymph nodes and three or more affected lymph nodes were 68%, 12%, 6% and 14% in the CAHRES data set and 65%, 15%, 7% and 13% in the ST01-08 data set. In Figure 2, we show histograms of number of lymph nodes affected, from 0 to 10, for CAHRES and ST01-08. We next examined the proportion of patients with lymph node positive breast cancer. This was done for tumor size intervals 1 to 10, 11 to 20, 21 to 30, up to 71 to 80 mm for the CAHRES data, for ST01-08 and for the combined data. In Figure 3, these proportions are plotted for each data set as circles, with bootstrapped 95% confidence intervals intersecting each circle. Sopik and Norad 20 recently presented the distribution of number of affected lymph nodes using a large number of patients included in the Surveillance, Epidemiology and End Results (SEER) program database. We note that the pattern of association observed in that large study is very similar to that displayed in Figure 3.

| Association between lymph node spread and grade, hormone receptor status, molecular subtypes and PRS
In Table 3, we present point estimates and 95% confidence intervals from our analyzes of associations between rate of lymph node spread and grade, hormone receptor status, molecular subtype and PRS.
P-values based on the data sets with outliers removed are presented in Table 4.
Modeling the association between lymph node spread and grade   We note that, for both studies, information on grade, ER status and PR status was not as complete as it was for tumor size and lymph node status. While distributions of the latter two characteristics were similar for the two studies, the proportions of high grade and ERÀ tumors differed. All statistical analyses based on grade and ER status, conditions on these characteristics (grade and ER status are included as covariates), therefore non-random missingness on these characteristics will not introduce bias in the analyses presented here.
We did not find any convincing evidence that the PRS is associated with the rate of lymph node spread.
As an illustration of our results, we display graphically the   21 based on a mover-stayer model, which is an extension of the Markov chain, where the population of tumors is assumed to consist of two unobserved groups, a stayer group consisting of tumors with a zero probability of change and a mover group following an ordinary Markov process.
The strongest evidence of association was found between grade and rate of lymph node spread. Higher grade breast cancer is generally less differentiated, more invasive and more proliferative. We therefore expected higher rates of lymph node spread for higher grade tumors. In our data, Grade 1 tumor had least lymph node spread across all tumor sizes; see Figure 4. When modeling the association between grade and rate of lymph node spread, both the linear and discrete relationships were highly significant. All P-values were less than 5 Â 10 À10 , with the average rate ratio between a grade x tumor and a While Figure 3 partly captures the clinical significance of the rate ratio estimates (for grade), using our model, with sufficient information, it would also be possible to describe clinical significance of the rate ratio estimates in other, perhaps even more meaningful, ways.
Evaluating time to LN spread would, though, need to incorporate information on the relationship between growth rate of the primary tumor and the tumor characteristic (or eg, PRS). We do not have such information. However, if we did, then we could calculate, for example, expected times to first affected lymph node, for each tumor characteristic, and we could even study the impact of delayed detection for tumors with different characteristics. For a description of how the latter can be done, see Isheden et al. 10 Our study of genetic factors and rates of lymph node spread may suffer from survivorship bias. In Libro-1, blood samples were collected after 2008, and in CAHRES, they were collected after 1997. Some women had already died before blood samples were collected, which means that the association between genetic factors and rate of lymph node spread in our study may be underestimated. In any case, we found no significant association between our PRS and rate of lymph node spread. Using Libro-1 data, Li et al 25 did not find a significant relationship between PRS and lymph node status, or between PRS and survival. Furthermore, we are not aware of any study that has

| CONCLUSIONS
Survival benefits of screening and treatment vary across different breast cancer molecular subtypes. 1 Some cancers are more aggressive and some cancers are less aggressive. In part, this is reflected in a tumors propensity to spread to the lymph nodes, which in many cases is a precursory step of distant metastatic spread. In our current study, we have quantified tumor aggressiveness in terms of rate of lymph node spread based on genetic markers, tumor characteristics and for different molecular subtypes. Quantifying tumor aggressivity may prove useful in the future era of individualized treatment and screening.

CONFLICT OF INTEREST
The authors declare no conflict of interest.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available from the corresponding author on request, after ethical approvals have been obtained from the Swedish ethical review board.