We analyzed 32 years of data on individuals' characteristics obtained from the Framingham Heart Study (FHS), configured as a social network. To ascertain the network ties, we computerized information from archived, handwritten documents that had not previously been used for research purposes, namely, the administrative tracking sheets used and archived by the FHS by personnel responsible for calling participants in order to arrange their periodic health exams ever since the Offspring cohort of the FHS began in 1971.
The tracking sheets were used as a way of optimizing participant follow-up by asking participants to identify people close to them. But they also implicitly contain valuable social network information. These sheets recorded the answers when all 5124 members of the Offspring cohort were asked to identify friends, neighbors (based on address), coworkers (based on place of employment), and relatives who might be in a position to know where they (the egos) would be in two to four years. The key fact here that makes these administrative records so valuable for social network research is that, given the compact nature of the Framingham population in the period from 1971 to 2003, many of the nominated contacts were themselves also participants of one or another FHS cohort.
We have used these tracking sheets to develop friendship links for FHS Offspring participants to other participants in any of the four ongoing FHS cohorts; in addition to the Offspring cohort, these cohorts included the Original cohort, the Generation 3 cohort, and the OMNI cohort (for details, see Christakis and Fowler 9 and their Supplementary Appendix). The tracking sheets allow us to know which participants nominated or were nominated by others as a close friend at each exam 12–17. The status of friendship between two people (a dyad) is identified by each party identifying the other as a close friend or not, yielding four possible dyadic states (null, directional (either direction), and mutual). Named close friends can be in any of the four FHS cohorts.
The FHS close friend network is the network analyzed in this paper. Of the individuals named by offspring cohort members as close friends, 55 per cent are also participants in the FHS with the majority of these (68.5 per cent) in the offspring or subsequent cohorts. The total number of dyads that have non-null status in any exam is 2572; these involve 3754 unique actors.
The offspring cohort includes seven waves of health exams conducted approximately four years apart and such that the waves do not overlap (the time period of the exams was centered in 1973, 1981, 1985, 1989, 1992, 1997, and 1999). At each exam, a detailed medical assessment is performed yielding an extensive array of personal characteristics and medical information, including the individuals' height and weight, smoking status, blood pressure, evaluation of depression, and (at some exams) girth measurements (e.g. waist, hip, and arm girth), education, and handedness. We have also been able to correctly assign addresses to virtually all subjects at all the waves they came in for examination. We can thus compute distances between individuals 18.
2.2. Key variables
The dependent variable, denoted Yijt, is the binary indicator of whether a tie exists from actor i to actor j at exam t. The predictors fall into three categories: network variables (e.g. whether or not a tie in the reverse direction exists), health traits (e.g. BMI, smoking), and other covariates (e.g. characteristics of the actors).
We include Yjit as a predictor to control for reciprocity (also known as mutuality) or the effect of j naming i as their friend on the likelihood of i naming j as their friend. In this longitudinal setting, we are interested in whether being reciprocally named as a friend affects the likelihood of an existing tie dissolving or of a new tie forming.
The mutable and highly observable health traits of interest are: BMI, body proportion, muscularity, and smoking. Of these, muscularity is likely the least observable trait. Mutable but less observable traits are depression score (a scale variable about which a binary-valued clinical measure of depression is often defined) and blood pressure (a continuous measure). The immutable but more observable traits considered are height and personality type. The immutable and less observable traits are birth order, being an only child, and handedness.
Body proportion is defined herein as height/(waist girth), 19 muscularity as (arm girth)/(waist girth), birth order as (nsiblings−rorder)/nsiblings where nsiblings is the number of siblings including oneself and rorder is the rank from oldest (1) to youngest (nsiblings) of an individual's age among their siblings, and personality type as the binary indicator of type A personality.
For continuously valued predictors (BMI, body proportion, muscularity, depression, blood pressure, height, and birth order), the dissimilarity of a variable X between individuals i and j at time t is defined as
Two related predictors are whether the behavior is more pronounced in the ego (the individual naming) than the alter (the individual named)
and the average value of the trait across the dyad
The vector containing the key predictors considered for continuous trait X is therefore Xijt = (X, X, X × X, X)T.
In what follows, we justify the use of these variables to summarize the values of a continuous trait across the ego and the alter. A variable X may be expressed as X = sgn(X)|X|, where |X| denotes the absolute difference and sgn(X) equals 1 if X>0 and −1 otherwise. A nice feature of decomposing X into |X| and sgn(X) is that they represent the main effects of magnitude and directionality and their product is the associated interaction. Scatterplots of the logit-transformed proportions of ties broken and new ties formed, respectively, versus the directional difference of BMI (ego–alter), grouped in small subintervals, are displayed in Figures 1 and 2. Over the interior region of the plot, the trend estimated using a lowess smoother is approximated by a V for tie dissolution and by an inverted-V for tie formation suggesting that the absolute difference is an appropriate dissimilarity metric for continuous traits 20. The absolute value has been used by others as a metric of dissimilarity of continuously valued variables. For example, Zeng and Xie use the absolute difference to test for homophily in age, grade-point average, and socio-economic status 20. We also considered using other distance metrics, such as the squared distance, but none offered general improvement over |X|.
Figure 1. Scatterplot of logit of tie-dissolution proportions versus BMI difference. Note: Each data point corresponds to a ‘bin’ containing 2 per cent of the observations where bin membership is determined from the quantiles of ego BMI −alter BMI. The proportion of tie-dissolution events transformed to the logit scale is plotted against the mean BMI difference for the 50 bins.
Download figure to PowerPoint
Figure 2. Scatterplot of logit of tie-formation proportions versus BMI difference. Note: Each data point corresponds to a ‘bin’ containing 2 per cent of the observations where bin membership is determined from the quantiles of ego BMI −alter BMI. The proportion of tie-formation events transformed to the logit scale is plotted against the mean BMI difference for the 50 bins.
Download figure to PowerPoint
For binary-valued traits (smoking, personality type, only child, and handedness), the key predictors are the indicator variables:
These variables indicate whether the ego, the alter, or both the ego and alter exhibit the trait (the left out category is that neither ego nor alter exhibits the trait), respectively. X and X are dissimilarity measures whereas X is a measurement of prevalence. The vector of key predictors for binary trait X is, therefore, Xijt = (X, X, X)T.
Other characteristics are contained in the vector Zijt. We control for age (the absolute value of the difference in age and average age), gender (both male and both female indicators), geographic separation (the physical geodesic distance between persons' residential abodes or the absolute change in this over the current exam and the preceding exam), and education (absolute difference and average of an ordinal variable ranging from 0 (none) to 8 (post graduate)). We see age and gender (and education) as being relevant in and of themselves, but, also to proxy for other traits, including enculturation regarding social behavior, wealth, and so on. Our main objective here was to include them as control variables for the other effects that are our explicit focus. We scaled all elements of Xijt and Zijt to have mean 0 and standard deviation 1, allowing effects to be directly compared between predictors.
Although the patients are examined on different days and have different lengths of time between exams, the exact date when the status of a friendship nomination would have changed is unknown. Therefore, we treat the data as if everyone was examined at the same time at each exam, yielding a regular longitudinal dataset.
2.3. Missing data
For each member of the Offspring cohort, we have data from up to seven medical examinations. Only 24 individuals dropped out of the offspring cohort (0.47 per cent). However, on occasion, individuals miss an exam or the data recorded from an exam is only partially complete. The latter occurs most frequently when a variable is purposely not included in an exam and on a small number of cases is missing due to chance (e.g. due to a data recording error); these data are most likely missing completely at random.
We treated height, birth order, and only child as time invariant as almost all study participants are adults. There were 10 cases where handedness apparently changed from left to right or vice versa; we treated these as if the patient was left-handed throughout making this trait ‘ever left-handed’.
Tie exams in which an individual misses an exam, dies, or otherwise terminates involvement with the FHS are excluded from the analysis. Thus, we implicitly assume that earlier observations on subjects who die during the study period adhere to the same underlying data generating process as observations on individuals who survive the entire study period.
2.4. Statistical analysis
The goal of the analysis is to determine the effect of actors' health traits on the status of close friend nominations in the FHS network. We propose a stochastic model in which the transition probabilities that the tie changes from connected to unconnected or vice versa depend on the status of the tie at the preceding exam and characteristics of the individuals.
The Markov property—namely that the status of a tie at the next exam depends only on the current status of the dyad (the pair of ties in each direction)—is inherent to the model. One of the reasons we expect the Markov property to hold is that the observation times are at least two years apart and so it is unlikely that the status of a dyad two or more exams in the past exerts much influence given the status at the preceding exam. Thus, conditional on Yijt−1 and Yjit−1, dyadic status at exam t−2 or earlier is considered uninformative.
Because participants generally name a single close friend contact at each exam, the observed network is sparse (as shown in Table I (A, B), a few named more than one contact, 0.5 per cent or less). Owing to its sparseness, cross sections of the FHS friendship network are not amenable to analyzing the effects of higher order effects such as transitivity 21. Therefore, we only consider models that, conditional on individual random effects, exhibit dyadic independence at each exam. An advantage of such parsimony is that the network for a random sample of dyads has the same distribution as the whole network; thus the model has the attractive property of being generative. However, because the data are rich longitudinally due to almost non-existent study attrition, the design still allows valuable information about the effect of health behaviors on dyad status to be obtained and presents a unique opportunity to study the dynamic properties of the network at the level of the dyad. The innovative feature of our models is thus the longitudinal component, which is very unusual for network models.
Table I. Degree distribution by exam: (A) for the offspring cohort members and (B) for those offspring cohort members who were in a non-null dyad at some point.
| ||Exam number|
|Number of named friends||1||2||3||4||5||6||7|
In recognition of the limited cross-sectional capacity of the data, we fit separate models for the tie-dissolution and tie-formation probabilities. The binary regression models for the tie-dissolution and tie-formation probabilities are given by
where θd, i∼N(0, σ), ηd, j∼N(0, τ), θf, i∼N(0, σ), and ηf, j∼N(0, τ) are the sender (nominator) and receiver (nominated) random effects, respectively. We analyze the data using the logit link g(p) = log(p/(1−p)). The fixed-effect parameters λ = (λd, λf)T, β = (βd, βf)T, and γ = (γd, γf)T quantify the effects of reciprocity, the health behavior variables, and the other covariates on the probability of a given tie dissolving (equation (1)) or forming (equation (2)), respectively. The models given by (1) and (2) are Markov conditional on the latent variables (i.e. random effects) for actors' sender and receiver effects and so the probability of tie transitions depend only on the current state of the tie and actor-specific latent variables. The models are thus a hierarchical generalized linear model with a bivariate random effect and lagged outcomes to account for cross-sectional and longitudinal dependence. While dyadic independence is a strong assumption, it is made more believable by the fact that the assumption is conditional on observed and unobserved actor-level attributes and associated effects 21. We compared the models presented in this paper to those with random effects for each dyad but found that they did not improve upon the bivariate actor-level random effect specifications in (1) and (2).
Because θi = (θd, i, θf, i)T and ηj = (ηd, j, ηf, j)T are random, inferences about β, λ, and γ are based on variation both between and within ties such that the weight of the former decreases as τ2 increases. We are interested in the effects of the predictors on individuals' close friend choices, as opposed to population average effects, so inferences focus on the parameters themselves rather than on marginal effects that average over observed (zijt) and unobserved (θ, η) variables.
The predictors are lagged to ensure that β is the effect of differences in health traits between individuals on their friendship status and not the reverse. Consequently, data from the first exam are only used to form predictors and do not contribute records used in model fitting. To estimate the effect of changes in health traits on tie dissolution and tie formation, a sensitivity analysis in which both xijt−1 and xijt are included in the model was performed.
If not for the situation when a study participant nominates multiple persons (the interviewer has no prerogative to pick one over another and so records all nominations), the sender variance components (σ, σ) would be pure measures of unobserved heterogeneity between exams in the propensity of an individual to dissolve or form a tie. The receiver variance components (τ, τ) quantify unobserved heterogeneity between individuals within and across exams. We refrained from fitting a hierarchical extension of the p2 model 22, 23, a traditional model for dyadic data, for two reasons. First, because the FHS only requires that participants name a single close friend at each exam, (σ, σ) are expected to be very small relative to (τ, τ) and possibly close to 0, making the correlation between sender and receiver effects, a key parameter of the p2 model, difficult to estimate. Second, the separable tie-dissolution/tie-formation model in (1) and (2) is easier to fit (especially on a network the size of ours) 24, 25.
Initially, we fit separate models for each trait. Thus, in the first model xijt−1 contained the predictors summarizing the actors' BMI; a separate model evaluates the predictors of body proportion, muscularity, and so on through the remaining traits of interest (depression, blood pressure, height, smoking, personality type, birth order, only child, handedness). This allows the marginal effect of each health trait to be determined. We then tested whether controlling for other traits modified the effect(s) of the trait being analyzed. However, with the exception of BMI and smoking, and BMI and body proportion, the traits had very little impact on each other, implying that modeling a single trait at a time was sufficient. In particular, we were concerned that the effects of BMI and smoking may not be fully revealed if they are not inferred from a single model as it is well known that smoking can be a form of weight control 26. Indeed, the joint inclusion of BMI and smoking increased the magnitude of the BMI tie-formation and the smoking tie-dissolution effects, the latter becoming significant at the 0.05 level (Section 3.2). The correlations between the estimated coefficients of the BMI and smoking variables are (surprisingly) small, ranging from −0.033 (BMI and Smoke) to 0.083 (BMI and Smoke). It makes sense that the largest and most positive correlation should occur between BMI and Smoke as these predictors are both measures of prevalence.
In a second sensitivity analysis, we substituted yjit−1 with three predictors: yjit−1yjit, (1−yjit−1)yjit, and yjit−1(1−yjit). These form a dynamic representation of reciprocity with specific effects for whether the incoming tie existed at the exams t−1 and t, whether the incoming tie was formed between exams t−1 and t, and whether the incoming tie dissolved between exams t−1 and t. The baseline level is non-existence of the tie at both exams. We performed this analysis to determine if the change in the status of the incoming tie had an effect on the change in status of the outgoing tie.
We used the lmer package in R 27 to fit the cross-classified random effect logistic regression models in (1) and (2). R was also used for data manipulation, supporting calculations, and figure development. All p-values reported in this paper assume two-tailed tests.
2.5. Sampling non-connected ties
Across the seven exams, a total of 1286 dyads (2572 potential ties) involving 1876 unique actors ever had a status other than null (no ties). Therefore, without considering the FHS members who did not name a FHS participant or who were not named by a FHS participant in the offspring cohort, there are approximately 1286 × 1285/2−1286 = 824969 dyads that remained null across the seven exams.
In the tie-formation analysis, we condition on dyads with null outbound ties at an exam and model the probability that the outbound tie forms (i.e. the given receiver is named) by the next exam. Given that the unit of response is the dyad exam, there are approximately 21 million ties that could form if each of 1876 actors knew of and could name every actor at exams 2 through 7, whereas a mere 568 ties transition from null to connected at some point (Section 3.2). To make the tie-formation model less computationally burdensome to fit, we randomly sampled dyads whose ties were unconnected at every exam (i.e. for whom a tie never formed) and combined these with the 1286 dyads that ever had a non-null tie. Specifically, for a randomly selected actor in each of these 1286 non-null dyads, we randomly sampled k actors for whom the resulting dyad was null across the seven exams, yielding an additional 1286k dyads (2572k tie-level observations). Such dyadic partners were restricted to offspring members who at some point nominated another FHS study participant to maximize the chance that the individuals actually know, and could have nominated each other. Dyads of individuals known to be relatives were excluded prior to sampling.
The ensuring analysis seeks to find the regression parameters that best discriminate between dyad exams where new outbound ties form and those where no tie forms. Although, the model in (2) is still appropriate for use, as for case–control analyses the intercept parameter does not reflect the overall proportion of null to non-null dyadic transitions.
The total number of dyads in the dataset for the tie-formation analysis is 1286(k+ 1), where k is the ratio of ‘controls’ to ‘cases’. We used k = 5 to compute the results reported here; larger values of k had little impact on the results (in general offering only slightly more precision). The sample size for the tie-formation analysis is the number of dyad exams at which the outbound tie could have transitioned from null to connected across the seven exams. The addition of the 6430 always-null dyads adds additional 38 580 (77 160) potential dyadic (tie)-level transitions across the seven waves to the tie-formation analyses. However, as noted in Section 3.2, some of these observations are not used due to missing dyadic status at the preceding exam.