#### Discussion on the paper by Handcock, Raftery and Tantrum

**Tom A. B. Snijders** (*University of Oxford and University of Groningen*)

The paper by Handcock, Raftery and Tantrum is an interesting new step in modelling social networks—more specifically, digraphs (directed graphs). This is a basic data structure for representing relational data, which is found to be increasingly important in all the social sciences.

To make a plausible stochastic model for one observation of a digraph, the crucial issue is to represent the stochastic dependence between the tie variables *Y*_{i,j}. Important types of dependence are dyadic dependence between reciprocal tie variables *Y*_{i,j} and *Y*_{j,i}, and triadic dependence between tie variables involving three nodes, such as the pair (*Y*_{i,j}, *Y*_{i,k}), or the triple (*Y*_{i,j}, *Y*_{j,k}, *Y*_{i,k}), which is used to represent tendencies towards transitivity.

Two general ways for representing dependence between tie variables have been presented in the literature. One is by postulating latent nodal variables and conditional independence of the observations, given the latent variables, in the classical Lazarsfeld tradition of latent structure models. This way is followed in the present paper. A discrete latent class approach was proposed in Nowicki and Snijders (2001). The second way is by directly modelling this dependence, as is done in exponential random-graph models. The landmark paper here is Frank and Strauss (1986); this type of modelling has become practically feasible especially since the model extensions that were recently presented in Snijders *et al.* (2006). Further research is needed to compare these ways of representing dependence in stochastic digraphs theoretically and empirically; one very practical advantage of the latent structure models is that they allow us to handle randomly missing data in almost trivial ways—which is quite unusual for techniques of social network analysis.

When using a latent distance model, a major question is the type of metric to employ. The current paper proposes a Euclidean metric with a superimposed clustering. An alternative is an ultrametric (Schweinberger and Snijders, 2003) which is equivalent to a system of nested groups, or clusters, without further structure. When employing the Euclidean metric the modeller must choose the number of dimensions; for the ultrametric model the number of nesting levels. The Euclidean metric is richer and its spatial arrangement has effects on the probability of ties both within and between clusters. Filling in this richer detail also requires more from the data. Which metric is more appropriate is an empirical matter and also depends on the substantive background knowledge and the substantive questions being asked. From the empirical side, a more detailed assessment of fit would be interesting than only the Bayes information criterion approximations that are proposed in the paper as a global measure of fit. It seems a good idea to follow the paper and to assess fit conditional on estimated positions (although this is not appropriate for comparing the fit with other types of model). Detailed fit assessments could be based on the contributions of dyads to the log-likelihood,

(ignoring in the notation for *l* the dependence on *X* and *β*). The fit of node *i* can be assessed on the basis of the sum . Comparing this across nodes will indicate which nodes conform relatively poorly to the Euclidean model. Similarly when *C*_{g} for *g* = 1,*ldots*,*G* are the sets of nodes defining the *G* clusters after post-processing the data, the quality of the representation of the within-cluster and between-cluster ties can be measured by using

and

respectively (where *g*≠*h*).

The sampling properties of such fit statistics will be quite complicated and a parametric bootstrap may be too time consuming to approximate them. However, given that we are discussing a fit assessment conditionally on the estimated latent positions, a natural first-order standardization is to treat the *Y*_{i,j} as independent binary variables for the given *X*, *Z* and *β*, and to standardize by using the accordingly calculated means and variances.

Next to a detailed fit analysis, a detailed sensitivity analysis will be interesting, to assess how well the data determine the Euclidean positions. Denote by *Z*[*i* ++ *u*] the array *Z* in which the position of node *i* has been translated by adding to it the vector *u*. Then Σ_{j} *l*_{i,j}(*Z*[*i* ++ *u*]) indicates the sensitivity of the conditional log-likelihood to translation of node *i* by the vector *u*. This will have a local maximum in *u* = 0 and preferably is approximately concave as a function of *u*. If there are regions in space that are away from *u* = 0 with local maxima which are not much lower than the value in *u* = 0, then node *i* has an ambiguous position. Similarly, the change in log-likelihood can be calculated that results from translating all nodes in a whole cluster *C*_{g} by the same vector *u*, or by orthogonally rotating all points in a cluster. This will yield possibilities for diagnosing how well the between-cluster patterns of ties determine the relative positions of the clusters.

The clustered spatial representation that is proposed in this paper represents what sociologists call the cohesive structure of the network. However, digraphs can have many structural features, and the cohesive structure is only one. While remaining in the framework of continuous latent variable models, it is straightforward also to represent the structural properties of hierarchy and of prominence. Hierarchy means that there is an order between the actors, and ties have a preferential direction. This can be important, e.g. when the relationship that is under study is an advice relationship, where the hierarchy could reflect expertise. Status differences also can give rise to hierarchically structured networks. Denoting now by *z*_{1i} instead of *z*_{i} the (multidimensional) locations representing propinquity, hierarchy can be represented by using an additional vector of one-dimensional latent variables *z*_{2i}, and adding

to the log-odds of the tie from *i* to *j*. This can be complemented by a third vector of latent variables, again one dimensional, contributing

to the log-odds of a tie; this represents prominence, defined as the propensity to have ties. For (*Z*_{2i},*Z*_{3i}), we could postulate mixtures of (correlated!) bivariate normal distributions. This is nothing other than a reparameterization of the random activity and popularity effects of van Duijn *et al.* (2004). The parameterization that is proposed here has the advantage that it directly expresses the hierarchical aspect of the network structure, which is substantively interesting in many applications. When employing latent Euclidean distance models to represent directed social networks, it seems to me that the default should also be to include hierarchy and prominence (or activity and popularity) dimensions in the latent variables.

Also after such an extension, the latent space models and the exponential random-graph models currently are the main ‘competitors’ for statistically modelling non-longitudinal observations of social networks. Further practical experience with these models is necessary to assess their worth; this will need to involve more detailed studies of fit and sensitivity than we have seen so far. I expect that, especially for modelling larger networks (with, say, a few hundred or more nodes), the latent space models will not be able to represent network structures as expressed by subgraph counts sufficiently well and the exponential random-graph models will not be able to represent the cohesive structure sufficiently well. Models that combine important features of these two approaches may be the next generation of social network models.

I am very pleased to propose the vote of thanks for this very interesting paper.

**Tony Robinson** (*University of Bath*)

A compelling advantage when using mixture model-based clustering in a Bayesian framework is the ability to obtain posterior probability information on *all* quantities of interest. This paper neatly incorporates the technology in a novel approach for social network discovery. However, the benefits of mixture modelling come with a cost, especially in a clustering application for, by definition, objects are ‘mixed up’ and the results of analyses must be examined carefully to glean the important information about likely forms of object partitioning. This is certainly so when clustering on observables and now, challengingly, we see the authors applying the technique to positions of actors in a social space which is latent. Partitioning these actors is clearly a major objective in the current exercise. Most clustering methods partition reasonably well when the degree of separation between groups is marked and correspondingly less well as the degree of overlap increases and model-based clustering is no exception. In using model-based clustering we often find that the results for marginal quantities often conflict or obscure. For example there may be a tension between the likely number of components and the sampled frequency of likely partitions. A careful examination of the marginal, joint and conditional behaviour of the results is necessary for sensible inferences to be drawn.

Moreover the results of model-based clustering can be sensitive to structural assumptions which directly affect sampled partitioning. One such structural assumption here is that of spherical Gaussian components. The authors give some justification for such a choice based on invariance of the likelihood under the co-ordinate system and sphericity will certainly conveniently cut down on the number of parameters. But I do not find this wholly convincing and wonder whether the authors are truly averse to alternatives such as Gaussian components with a more flexible covariance structure or even non-Gaussian components. Are there underlying substantive considerations concerned with the nature of the social space and the clustering behaviour of actors within it or is social space so flexible as to render the distributional model choice essentially immaterial? I doubt that the latter is always so and question whether such a restrictive model can sometimes lead to too many clusters or overdispersed estimates which would affect the partitioning.

The authors take a fairly traditional approach to deciding on the number of clusters apart from conditioning on actor positions. There are other approaches such as transdimensional samplers but they require even more sophistication in implementation. If determination of the number of clusters is made conditional on a posterior estimate of actor positions, care must be taken to ensure that these are determined fairly and are not influenced by an overly restrictive model specification and by design of a sampler that allows a free mixing of actor positions across the latent space to avoid imposing artificial clustering. Deciding on the number of clusters needs to take account of all aspects of the model.

This determination of configurations of actors in the social space has clear parallels with multidimensional scaling in that both techniques aim to produce an interpretable configuration in a space of specified dimension from which structure can be identified. The default choice for the dimension of the latent space seems to be 2 as it is in most applications of multidimensional scaling undoubtedly driven by ease of visualization in both cases. It would be standard good practice in multidimensional scaling to explore solutions in other nearby dimensions and I would have liked to see the same in the two examples of latent space clustering. The authors do reference Oh and Raftery (2001) as a possible way to choose but otherwise leave the choice of dimension to be specified by the user who will no doubt also choose 2 as a starting- and possibly a finishing point. For example in the adolescent health example, would the choice of three or more dimensions lead to separation of the higher grades? Similarly would a choice of one dimension yield essentially the same results as two dimensions, as inspection of Fig. 8 seems to indicate groups with roughly increasing grades with anticlockwise movement around the configuration and the higher grades curling back towards the lower.

I also have a worry about the basic underlying model as specified in equation (2). If we accept that there may be underlying clusters, why should the covariates not act differently among them? The global behaviour in equation (2) clearly does not allow this possibility.

I believe that the authors have made a decent start at designing a potentially useful technique for clustering in static social networks but that users need to be aware that the technique is far from problem free and that they must be careful not to overcook the recipe and thereby to overinterpret results. It is my pleasure to second the vote of thanks.

The vote of thanks was passed by acclamation.

**Anthony C. Atkinson** (*London School of Economics and Political Science*) **and Marco Riani** (*Università di Parma*)

Over the years we have enjoyed both Adrian Raftery's talks and his flow of publications on model-based clustering. We would like to compare some results of the use of mclust with a cluster analysis that is produced by the use of the forward search.

The forward search for multivariate data is described in Atkinson *et al.* (2004). In general the search proceeds by successively fitting subsets of the data of increasing size. For a single multivariate population any outliers will enter at the end of the search with large Mahalanobis distances. If the data are clustered and the search starts in one of the, unfortunately unknown, clusters, the end of the cluster is indicated when the next observation to be added is remote from that cluster. To find clusters we have recently (Atkinson *et al.*, 2006a, b; Cerioli *et al.*, 2006) suggested running many searches from randomly selected starting-points. Some of these start in, or are attracted to, a single cluster; a forward plot of the minimum Mahalanobis distance of the observations that are not in the fitted subset then reveals the cluster structure.

As an example we analyse 272 observations on the eruptions of the Old Faithful Geyser taken from the Modern Applied Statistics in S library (Venables and Ripley, 2002). Azzalini and Bowman (1990) described the scientific problem. Fig. 9(a), a forward plot of Mahalanobis distances from the forward search, clearly shows the two groups. Fig. 9(b) is the Bayes information criterion output of mclust from S-PLUS which, on the contrary, indicates three clusters. Fig. 2 of Fraley and Raftery (2006) for a slightly different set of geyser data is similar and again indicates three clusters.

We use further forward searches to establish membership of these two clusters and establish the unclustered units. A scatterplot of the resulting two clusters is shown in Fig. 10(a). The three clusters that are found by mclust are in Fig. 10(b). Fuller details of our analysis including further comparisons and considerations of robustness are in Atkinson and Riani (2007).

We do not want to imply that the imposing edifice, some of whose rooms we have so enjoyably visited today, is built on sand. But it does seem that there are still some fundamental problems in the foundations of clustering that need to be resolved.

**Isobel Claire Gormley** (*University College Dublin*) **and Thomas Brendan Murphy** (*Trinity College Dublin*)

We congratulate the authors on a thought-provoking paper. We feel that the combination of model-based clustering with latent space modelling is applicable far beyond the proposed application of the analysis of social network data.

In our work, we found that model choice is a difficult aspect of the modelling process. The number of components in a mixture model can be estimated consistently by using the Bayes information criterion (Keribin, 2000) but we found that the choice of dimensionality in our latent space model is more problematic. We were wondering whether the authors could provide us with insight into the methods for choosing the dimensionality of the latent space in their social network model.

More recently, we have considered methods for including covariates in our models. One approach that we have considered is allowing the mixture probabilities to depend on covariates (Gormley, 2006); this yields a special case of the mixture-of-experts model (Jacobs *et al.*, 1991). This model can be fitted very easily with minor changes to the mixture modelling framework. In the context of this paper this may provide an alternative method for achieving homophily by attributes.

**Trevor Sweeting** (*University College London*)

I would be interested to hear from the authors whether they have considered using an infinite group cluster model and, if so, what they would consider to be the relative advantages and disadvantages of such a formulation over their finite group cluster model in the context of network models. There are various possible Bayesian formulations of infinite group cluster models. A common choice of prior distribution for the group weights *λ* arises from a Dirichlet process prior structure for the parameters of the latent positions, since this structure automatically induces clustering. Specifically, writing the Dirichlet process mixture (DPM) structure for the latent positions would be specified as

Here DP(·, ·) denotes a (*d*+1)-dimensional Dirichlet process and *F*_{0} and *γ* are the associated mean and precision parameters. Now, for and *θ*=(*θ*_{1},*θ*_{2},*ldots*). Using Sethuraman's (1994) stick breaking representation of the Dirichlet process, the above specification is equivalent to the following infinite group version of the authors’ model:

For additional flexibility a prior distribution is often assigned to the precision parameter *γ*, chosen to reflect the prior expectation and uncertainty about the number of clusters that are contained in the data. It would be of interest to explore whether Markov chain Monte Carlo (MCMC) schemes in the literature in the case where the *z*_{i} are not latent (see, for example, Neal (2000)) could be readily integrated with the MCMC scheme that is given in Section 3.2 of the paper.

The DPM model is just one possible model for Bayesian clustering. The generalization of the DPM model to product partition models, for example, is described in Quintana and Iglesias (2003). Potential advantages of an infinite group over a finite group cluster model would be that, firstly, neither the Bayes information criterion nor reversible jump MCMC methods would be necessary for estimation of the number of clusters, *G*, contained in the data; secondly, uncertainty about *G* could be readily assessed; thirdly, an infinite group model would deal more cleanly with the situation that is discussed in Section 6 where network data are available for only part of a population so that other clusters may not yet have been represented.

**David S. Leslie** (*University of Bristol*)

I congratulate the authors for their interesting paper. However, it seems that the Markov chain Monte Carlo sampling scheme that was used results in extremely slow mixing, requiring 2 million iterations with only every 1000th iteration being used. One aspect of this slow mixing relates to a problem that was encountered by Leslie *et al.* (2006).

The problem arises when we move from a simple latent structure to a mixture model latent structure. In Leslie *et al.* (2006) the transition was from simple probit regression with a normal latent structure to a binary choice regression model with latent variables drawn from a Dirichlet process mixture model. In the current paper the transition is from the simple normal model for the latent variables that was used by Hoff *et al.* (2002) to a situation in which the latent variables are drawn from a mixture of multivariate normal distributions. The natural sampling scheme to use when such an extension is made is that presented in the paper, where the component labels *K* are sampled conditionally on the latent variables *Z*; then the latent variables are sampled conditionally on the component labels *K*. However, a simple example suffices to see that this is likely to result in extremely poor mixing.

Consider the two-dimensional latent variables that are shown in Fig. 11, and consider first the process of updating the latent variable of the point that is marked; we shall call it point *i*. The latent variable *z*_{i} is drawn conditionally on membership of cluster 1 and so is highly likely to be close to the other members of cluster 1, and hence far from the members of cluster 2. Now consider updating the cluster label *K*_{i} conditionally on the latent variable value *z*_{i}: it is highly unlikely that *i* will be allocated to cluster 2 owing to the location of the latent variable *z*_{i}. As seen by this example, it is very difficult for the latent variables to move between clusters, owing to the high correlation between cluster labels *K*_{i} and latent values *z*_{i}.

**N. T. Longford** (*SNTL, Leicester, and Universitat Pompeu Fabra, Barcelona*)

Mixtures of multivariate normal distributions, which are used by the authors with great skill, are a greatly underrated device for generating a wide variety of distributions. In more than two dimensions, the normal is the only comprehensive class of distributions that is easy to handle in the standard likelihood-related calculations. By fitting a multivariate normal mixture we approximate the target distribution. We are fitting not only the modes of the distribution but also its shoulders and tails. Therefore mixture components cannot be automatically associated with clusters. All clusters are ‘condemned to be normal’. For example, none of the clusters in the authors’ examples could be associated with a skewed latent distribution. Admittedly, the layer of latentness grants some flexibility in this respect, but a mixture component can be declared a cluster only when its variances are small relative to the distance of its expectation from the expectations of the other components. In a different context, Longford and Pittau (2006) present an analysis in which multivariate mixtures cannot be regarded as clusters because the mixture components differ principally by their patterns of variation and dependence.

The term ‘determination’ (of the number of components) sits very uncomfortably in the Bayesian terminology, because it implies elimination of any uncertainty. Instead of concluding that there are three components in the first example, I would prefer a brief discussion of the solutions for two and four components, accompanied by a comment on how much less likely they are and how their solutions are related to the preferred three-component solution.

I understand that inferences in both examples are made for the subjects in the respective studies, not for the underlying population. In a frequentist view, there is a puzzling ambiguity. What distribution would govern the responses of the 18 monks in a replication of the study: the posited model? A more realistic alternative is that some (strong) links would be declared in every replication, whereas some other (weaker) links may be declared with a range of probabilities. If all the links are strong there is no variation in the replicate response patterns and, presumably, there is no uncertainty in the inference. Would the process of forming links be also replicated? I would be concerned if these issues were regarded in the Bayesian paradigm as not relevant, even though I appreciate that some would be difficult to incorporate in the analysis.

**John T. Kent** (*University of Leeds*)

The motivation for the statistical models in this paper is mainly focused on the networking and social sciences perspective. However, it is also helpful to draw out the analogies to more conventional statistical methodology. For example, in regression analysis, we can

- (a)
start with a geometric relationship *y*=*a*+*bx*,

- (b)
include normal errors to obtain the usual linear model,

- (c)
extend this framework to a generalized linear model, with for example Bernoulli observations,

- (d)
include random effects to allow for grouping and

- (e)
view clustering as an unlabelled random-effects model.

Similarly, for data on the relationships between *n* individuals or sites, we can

- (a)
start with the mathematical result that knowing all the Euclidean distances between the *n* sites determines the configuration of sites (up to translation and orientation),

- (b)
introduce stochastic errors to obtain the classic multidimensional scaling estimation problem,

- (c)
extend this framework to a generalized linear model for the presence or absence of edges and

- (d)
introduce random effects and

- (e)
introduce clustering as before.

Since an underlying latent configuration is determined only up to translation and orientation (and often size), it enters the statistical model only through its shape. For example in equation (7), when updating *μ*_{g}, it is only the shape (or shape plus size) of the configuration which is identifiable, not the whole configuration, and the updating exercise should take this restriction into account. I suspect that the incorporation of shape ideas into the analysis would have only a minor practical effect, but it would be the ‘right’ approach in terms of identifiability.

A recent development in shape analysis is the investigation of unlabelled shapes, where we may want to match two configurations together, but not know which sites correspond. One application is to protein structure analysis, where the positions of atoms on each protein can be determined by X-ray crystallography and where it is suspected that (subregions of) proteins of similar shape have similar biological function, but where the labellings are unknown. There are some similarities to the problem of comparing clusterings determined by the {*μ*_{g}} in different Markov chain Monte Carlo simulations.

**Tony Lawrance** (*University of Warwick, Coventry*)

It is a pleasure to contribute to the discussion after the experts have spoken. My experience of this area and paper is the 2-hour train journey from Warwick to London, a distance of nearly zero according to the metric which this paper induced. My first point is to enquire how the analysis can address the measurement of friendliness or interactions of the actors that are involved, rather than their groupings. Secondly, I noticed that the modelling is predicated on a conditional independence assumption, which it must be tough to validate and is probably a matter of faith. I could not immediately see any attention in the paper given to assessing the fit of the model, and the choice of prior forms seemed clever, but I wonder how much can they influence the final groupings? Wider empirical validation, replacing monks and monasteries by lecturers and departments, would satisfy me more generally. My final observation concerns the microscopic pie charts, noting that they take the development of invisible graphics to a new level, at least judging by my monochrome preprint. Overall, I thought that this paper was a nice blend of methodology and application.

The following contributions were received in writing after the meeting.

**Edoardo M. Airoldi** (*Carnegie Mellon University, Pittsburgh*)

The authors’ work with the *latent space clustering* methodology provides an impressive demonstration of the use of hierarchical models for identifying groups of nodes from observed connectivity patterns. Modelling choices based on sociological principles, i.e. transitivity and homophily, increase its appeal as an exploratory tool for the analysis of social networks. The methodology proposed goes only part way, however, towards addressing fundamental issues that arise in the statistical analysis of social networks.

The *stochastic blockmodel of mixed membership* in Airoldi (2006) and Airoldi *et al.* (2007a) offers an alternative approach with different insights on latent aspects underlying network structure. Models in this family also posit the existence of an unknown number of clusters; however, they replace latent positions with mixed memberships *π*_{1:N}, which map nodes to (one or more) clusters, and add a *latent blockmodel* B that specifies cluster-to-cluster hierarchical relations. These parameters are directly interpretable in terms of notions and concepts that are relevant to social scientists, and better suited to assist them in extracting substantive knowledge from noisy data, ultimately to inform or support the development of new hypotheses and theories. Therefore, inference about *π*_{1:N} and B is crucial for the analysis of data.

Applying this to Sampson's data demonstrates both linkages and differences. Our version of the Bayes information criterion also suggests the existence of three factions among the 18 monks, but our groupings are different. In Fig. 12, Romul and Victor (two of Sampson's Waverers) stand out; and so do Greg and John who were expelled first from the monastery. The mixed membership map is specified by using node-specific latent vectors *π*_{1:18}, independent and identically distributed samples from a three-dimensional symmetric Dirichlet(*α*) distribution. The map of hierarchical relationships among factions is specified by a 3×3 matrix of Bernoulli hyperparameters *B*, where *B*(*i*,*j*) is the probability that monks in the *i*th faction relate to those in the *j*th faction. Other features that are relevant to data analysis are the marginal probability of a relation and the relation between the number of clusters and dimensionality of the latent simplex.

Our models allow a focus on issues such as membership of monks in factions, and this could lead to the formation of a social theory of failure in isolated communities, which is capable of testing with longitudinal data. In Airoldi *et al.* (2007), we provide full details on specification, estimation and interpretation for both the Sampson and the adolescent friendship network examples.

**Julian Besag** (*University of Washington, Seattle*)

I would like to comment on the authors’ choice of examples. After all, social networks have been around for a long time and there is an abundance of data, so we should be expecting more than purely illustrative analyses by now.

In their first example, the authors deem an edge from *i* to *j* to exist if *i* cites *j* at *any* of his three interviews. In general, if clusters change over time, such a rule could lead to spurious results. Moreover, as regards social science, I would assume that the temporal development of clusters, including their creation, coalescence, fragmentation and destruction, is of more interest than their static properties. Although three time points are probably too few for meaningful analysis, more extensive space–time networks could have been chosen. Such analysis is particularly important for communicable diseases. Note that, in setting up space–time models, multiple changes in edge configurations can occur (almost) instantaneously, though this is sometimes overlooked.

As regards their second example, do the authors have a justification for focusing on one particular school out of 132? It seems to me that they should at least have analysed a small sample of schools. And why were no covariates included, particularly the grade of student? To claim success in extracting grade as an important clustering attribute suggests to me that the authors are too easily satisfied. Their secondary conclusions are plausible and could have been checked in other schools. The general point here is the effect of including cluster attributes as covariates, which is allowed in their original formulation but apparently not in their examples. How does this affect cluster identification?

Lastly, do the authors have anything to add about the relevance of their approach to the huge networks that for example AT&T and Microsoft researchers must deal with and for which quite different methods are used? Is this merely a computational issue or is it that exploratory techniques are more appropriate?

**David Blei** (*Princeton University*) **and Stephen E. Fienberg** (*Carnegie Mellon University, Pittsburgh*)

We congratulate Handcock, Raftery and Tantrum for this interesting and elegant paper that proposes combining the latent space and stochastic blockmodels of sociometric data. We found it especially instructive since it parallels our efforts to develop a similar analysis (Airoldi *et al.*, 2007a,b). We shall compare the two approaches.

The authors’ construction integrates the latent space model for relational data with a ‘traditional’ cluster model based on a finite mixture of Gaussian distributions. Their methodology mixes the Bayesian approach to cluster estimation with a likelihood variant of the latent space model. This is valuable for exploratory analyses of sociomatrices.

Our approach begins with a random mixed membership vector for each actor (Erosheva, 2003, 2004; Blei *et al.*, 2003). These vectors can be viewed as describing a soft clustering, where each actor belongs to multiple clusters with different proportions. The binary relationships between actors, i.e. the observed data, are mediated by per-pair latent variables, each drawn conditioned on an actor's mixed membership vector. In its general form, we allow for multiple relationships and covariates. This is a Bayesian hierarchical model.

The model that is proposed here can also be thought of as a hierarchical model, specifically when a Gaussian prior is placed on the latent position variables. In contrast, however, each actor belongs to a single cluster and the corresponding partition governs the observed relationships. There can be variance in the latent position variables, but the idea of belonging to two or more groups cannot be represented. Posterior uncertainty about cluster membership (depicted by the pie charts in the authors’ figures) is different from *mixed membership*, which carries with it an additional level of uncertainty. That said, the latent space of the authors is quite comparable with our proposed space of cluster proportions. They map actors to Euclidean space; we map actors to the simplex.

We and the authors have the same goal: infer the underlying latent structure from an observed sociomatrix. In the mixed membership model, full Markov chain Monte Carlo sampling for any but the simplest problems is unreasonably expensive. We have appealed to variational methods for a computationally efficient approximation to the posterior. These methods can scale to large matrices because of the simplified approximation (but at an unknown cost to accuracy). It would be interesting to understand computational trade-offs for the authors’ method as the sample size grows and when large numbers of covariates are added.

**Ronald Breiger** (*University of Arizona, Tucson*)

As Handcock and his colleagues refer to their model (in Section 1) as ‘a stochastic blockmodel’, and as they apply their latent position cluster model (LPCM) to a data set that was analysed much earlier by White *et al.* (1976) in their paper on blockmodels, it may be instructive to focus on the agenda that was put forward in the earlier paper and on the extent to which the new paper furthers that agenda.

As indicated in their paper's title, White *et al.* (1976) insisted on modelling social structure from ‘multiple networks’. In blockmodel analysis, partitioning of individuals is only one side of a dual problem, the other being interpretation of the pattern that is formed by that partition. The clique pattern of sociometry, which seems well generalized by transitivity, homophily on attributes and clustering, as in the LPCM model, is only one possible pattern for a blockmodel, in which some sets of actors might be understood as structurally important because they have no ties among themselves but are all tied to the same other groups. A fundamental concern was modelling the catenation of ties of different networks (such as ‘friends of advisers’).

Recent lines of research have resulted in breakthroughs in carrying forward this agenda. Generalized blockmodelling (Doreian *et al.*, 2005) permits ideal block types defining network equivalence to differ across pairs of blocks. Exponential random-graph modelling is providing a firm statistical foundation for studying catenation of ties across multiple networks (e.g. Lazega and Pattison (1999), pages 84–85). And stochastic blockmodelling, which was developed for a partition of actors specified *a priori* (Wang and Wong, 1987) or on the basis of a clustering algorithm (the models that were reviewed in Section 1 of this paper), is supplying a firm statistical foundation to replace the *ad hoc* clustering procedures that were often used in the earlier work. The LPCM is so appealing because it is based on a specifiable model of network structure (though I hope that other models will eventually also be specified), because it articulates so well with the statistical foundations of exponential random-graph modelling, and because the results are so sharp. Social networks researchers are in debt to Handcock and his colleagues for these substantial advances.

Statistical work on blockmodels has focused on partitions of actors, but not yet on specifying patterns of equivalence among blocks. Future work might address this issue along with partitions across multiple networks and those based on wider varieties of patterns of tie (such as those found in negative affect networks).

**Carter T. Butts** (*University of California, Irvine*)

Handcock, Raftery and Tantrum have ably demonstrated the potential for latent space models to address certain time-worn questions in network analysis within a modern statistical framework. One important limitation of this model, however, is that it cannot represent systematic biases in the orientation of asymmetric dyads (apart from covariate and/or activity effects). This is a simple consequence of the symmetry of |*z*_{i}−*z*_{j}| and similarly holds for the projection model of Hoff *et al.* (2002). The inability to represent orientation bias is consequential in various settings, particularly where status differences are present. A natural example with relevance to the present paper would be systems of ranked clusters (Davis and Leinhardt, 1972; Holland and Leinhardt, 1970), in which we observe multiple cohesive social groups whose intergroup connections form a partial order. Biased asymmetry is also central to the incidence of transitivity *per se* (as opposed to mere triadic clustering), a feature which the authors single out as being of particular importance.

A simple extension which would rectify this limitation is suggested by the geographical literature on flow matrices. Given a matrix *Y* of point-to-point flows, Tobler (1976, 2005) suggested decomposing the matrix into symmetric (*Y*^{+}=(*Y*+*Y*^{T})/2) and skew symmetric (*Y*^{−}=(*Y*−*Y*^{T})/2) components. The symmetric component matrix *Y*^{+} is modelled via multidimensional scaling methods, whereas the skew symmetric matrix *Y*^{−} is modelled via a *potential surface**f*, such that . Intuitively, overall interaction is then governed by proximity in the latent space, whereas the direction of any asymmetries is determined by relative potential (with flow tending to proceed ‘downhill’ on the potential surface).

Incorporation of this notion into the authors’ model is easily accomplished via modification of equation (2). Let *W* = {*w*_{i}} be a set of latent vertex potentials, with *w*_{i} ∈ ℝ^{k}∀*i*, and let *β*_{2} be a *k*-vector of non-negative real parameters. We then posit that all edges are conditionally independent, with log-odds given by

In many circumstances, it seems reasonable to assume *k*=1 (i.e. a single-status ordering). *k*>1 is possible, however, if multiple status dimensions are active within the network. The prior structure for *W* may be constructed analogously to that of *Z*, although the set of invariances is somewhat more restricted. The addition of vertex potentials to the latent space model is thus a very simple extension, but one which corrects a consequential limitation of the present approach.

**Patrick Doreian** (*University of Pittsburgh*) **and Vladimir Batagelj and Anuška Ferligoj** (*University of Ljubljana*)

The paper offers an intriguing approach to partitioning networks where the goal is to partition the vertices. Our comment points to an alternative approach: one that we hope is compatible with theirs.

Generalized blockmodelling (Doreian *et al.*, 2005) has a primary goal of discerning the structure of a network via homomorphisms of the network to simpler images. This entails explicitly partitioning both the vertices into clusters (called *positions*) and the relational ties into *blocks*. Blocks are specified by predicates that are used to characterize permitted block types that describe structure. Sets of predicates correspond to specific equivalences. For example, null and complete blocks correspond to structural equivalence whereas null and one-covered blocks correspond to regular equivalence.

The pattern of ideal blocks in the image characterizes the structure of the image matrix which, in turn, describes the underlying structure of the empirical network. Specifying a blockmodel can range from specifying only the permitted block types to specifying a block type for every location in a blockmodel. Given a specified blockmodel, empirical blockmodels are identified by a local optimization clustering algorithm that minimizes a criterion function: one that must be compatible with, and sensitive to, the equivalence that is defined for the specified blockmodel.

In the paper, transitivity is the driving structural feature. However, many blockmodels are consistent with transitivity. These include complete diagonal blocks with null blocks elsewhere, a complete upper triangular network with null blocks elsewhere (dominance structures) and complete diagonal blocks, for positions *i* and *j* with the upper (*i*,*j*) block complete and null blocks elsewhere. This suggests that transitivity is ambiguous with regard to block structures and incomplete for specifying network structure.

The likely structure of a high school network is one with denser patches of ties within grades. Six grades suggest six positions and a blockmodel structure of diag(den) where a density threshold is set. The off-diagonal blocks are null. This specification has the unique partition that is shown in Fig. 13. We use grades to label vertices and undirected lines for reciprocated ties. The overlap between the grades and the clusters is shown in Table 5. There is consistency between their partition and ours. Both are readily interpretable.

Table 5. Partition clusters and grade levels *Grade* | *Result for the following clusters:* |
---|

*1* | *2* | *3* | *4* | *5* | *6* |
---|

7 | 13 | 1 | 0 | 0 | 0 | 0 |

8 | 0 | 10 | 2 | 0 | 0 | 0 |

9 | 0 | 0 | 10 | 0 | 0 | 6 |

10 | 0 | 0 | 0 | 10 | 0 | 0 |

11 | 0 | 0 | 1 | 0 | 11 | 1 |

12 | 0 | 0 | 0 | 0 | 0 | 4 |

Our approach is computationally simpler and also explicitly describes network structures. The appeal of the statistical approach includes the estimation of *k* and having an inferential foundation. It would be nice to couple these approaches to the benefit of both.

**David Draper** (*University of California, Santa Cruz*)

This excellent paper provides a nice example of contemporary likelihood and Bayesian analysis in an interesting social sciences setting; I have a comment about validation of the methods proposed. There are two main ways to evaluate the quality of a statistical method: *process* (do the assumptions on which the method is based seem reasonable?) and *outcome* (when you know what the right answer is, does the method tend to give you back known truth?). Of these two approaches, outcome evaluations are generally stronger than process assessments. For me, the methods of this paper pass a judgmental process test quite well (with one question; see the comment by Mendes and Draper). Regarding outcome, in their first example the authors seem happy when their Bayesian fitting method reproduces the cluster structure previously identified by the researcher who collected the data, and they mention in further support of the idea that the Bayesian result is ‘good’ that ‘Overall the Bayesian estimate of the latent position cluster model produces greater distinctions between the groups *ldots*’. Of course these are not true outcome evaluations, because there is no comparison with known truth; in a sense they are more like another kind of process evaluation (do the results produced by the method seem reasonable?). The validation story appears a little stronger in the authors’ second example, where the Bayesian fitting method by and large succeeds in inferring what grade the students were in without using that information in the fitting process; the authors are less happy to find that the maximum likelihood approach chose only two clusters, but this ignores the possibility that the dominant clustering is not by grade but by the (perhaps even stronger) distinction in the American schooling system between middle school and high school (indeed, Fig. 4 supports this two-cluster ‘explanation’, with the ninth graders occupying a transitional role between middle and high school; the authors say that they ‘consider a single school of 71 adolescents from grades 7–12’, but it is rare in the USA for all six of those grades to be taught in the same building). In the absence of known truth, both the two-cluster and the six-cluster solutions seem plausible, and plausibility (not validity) is the strongest conclusion one can claim. A more convincing validation exercise would involve

- (a)
finding a social network situation in which the actors perceive themselves as members of explicit social clusters,

- (b)
eliciting from each actor two kinds of information—the relational ties (e.g. ‘I'm friends with persons *a* and *g*’) and a form of personal ‘truth’ (e.g. answers to questions like ‘I identify myself as a member of cluster *X*’)—and

- (c)
trying to infer the personal truth from the relational tie information without using the former in the inferential process.

Has anyone tried this form of validation in the field of social networks?

**Marijtje A. J. van Duijn** (*University of Groningen*)

I congratulate the authors on proposing—and making available through accessible software—a very interesting network model that incorporates some important concepts from social network analysis.

The inclusion of transitivity (and reciprocity, in the case of directed relations) through latent positions is interesting, where cluster membership encompasses these (and possibly more) structural effects through the use of spatial proximity. In the two applications that were presented in the paper, the definition of space in two dimensions seems adequate. The authors, however, do not make any suggestion for the interpretation of these dimensions. It might be interesting to investigate whether these dimensions are related to other network or actor characteristics that are not included in the model, in the same way that the clusters in the applications were found to correspond—in varying degree—to known attributes. A logical next step would be to include these attributes in the model. I wonder about a possible trade-off between interpretability of clusters and model specification.

The random sender and receiver effects of the *p*_{2}-model (van Duijn *et al.* (2004), with accompanying software available at http://stat.gamma.rug.nl/stocnet) could be considered to define a latent space with a clear interpretation. Unlike the latent position cluster model and the earlier latent distance models (Hoff *et al.*, 2002; Hoff, 2005; Shortreed *et al.*, 2006), the *p*_{2}-model uses the dyadic outcome as dependent variable and thus explicit incorporates the tendency of reciprocity within dyads. Its focus is on (fixed) actor and dyadic attribute (homophily) effects, and the model does not take into account transitivity or other triadic structural effects; nor does it consider the spatial representation of the network.

Model selection seems to be a topic requiring further investigation, in latent space models, and in the *p*_{2}-model (Zijlstra *et al.*, 2005). The somewhat heuristic Bayes information criterion approximation for the latent position cluster model seems to work quite well and is supported by the recent application of the Bayes information Monte Carlo criterion BICM (Raftery *et al.*, 2007). First results with BICM as a model selection criterion in the *p*_{2}-model are encouraging.

**Katherine Faust and Miruna Petrescu-Prahova** (*University of California, Irvine*)

Handcock, Raftery and Tantrum should be commended for presenting a principled basis for network scaling and node clustering. Our comments situate the latent position cluster model in relation to other social network analysis approaches and point to comparisons that facilitate interpretation.

The latent position cluster model contributes to a venerable tradition in social network analysis: combining spatial representation of social proximity with node clustering to identify subgroupings of actors. This combination often gives rich insight into social network structure, as seen in the authors’ monastery and adolescent friendship examples. The latent position cluster model improves on extant methods by providing a principled way to determine dimensionality and number of clusters. It gives a model-based approach for network visualization with a precisely defined relationship between node distances and network ties. Combining clusters with positions shows internal differentiation within clusters and proximity of clusters relative to each other. These are valuable advances over many standard network methods for visualization and subgroup detection.

Regarding comparison of the two approaches, the authors observe that

‘… Bayesian estimate of the latent position cluster model produces greater distinctions between the groups than the two-stage estimate…’

(page 311). Indeed, clusters appear to be more easily distinguishable in Bayesian estimates. This is due to longer distances between cluster averages, but mostly to lower within-cluster variability in node positions, a point that is obscured by differently scaled axes in Figs 1 and 3. We refitted the two-dimensional, three-cluster model to the monastery data by using both approaches and display results in Fig. 14. Mean within-cluster distances for two-stage estimates are 1.164, 0.911 and 0.642, and for Bayesian estimates are 0.531, 0.317 and 0.279, for clusters 1, 2 and 3 respectively. Clearly, clusters from Bayesian estimates are more compact than are clusters from two-stage maximum likelihood estimates.

Assessing dimensionality is a valuable feature of the model that is not fully exploited in the paper. With regard to the monastery data, greater variability on the horizontal than vertical axis in Fig. 3, the apparent ‘horseshoe’ configuration, and only three arcs between clusters 1 (on the left) and 3 (on the right) suggest the possibility of a one-dimensional solution. The one-dimensional solution has BIC=−360.6542 compared with BIC=−305.8171 for two dimensions, and so it is not appropriate for these data.

**Jonathan J. Forster** (*University of Southampton*)

I have two questions concerning this interesting and stimulating paper. The main attraction of using the Bayes information criterion (BIC) in model comparison is that it can often be calculated by using outputs of standard packages. Given that the authors have already devoted considerable computational effort to carefully calculating or simulating posterior distributions, I wondered whether they had considered also more accurately approximating the marginal likelihood. It strikes me that the extra effort that is involved would be relatively little, and it would circumvent the difficulties in choosing a suitable *n* for the BIC formula (I do not find the argument for using the actual number of ties for the logistic regression BIC to be all that compelling). More generally, the paper considers spherically distributed clusters in a two-dimensional latent space. Could the authors give any insight into the benefits or problems that are associated with relaxing either of these assumptions?

**Andrew Gelman** (*Columbia University, New York*)

Social networks are important for their own sake and for their role in propagating phenomena such as political polarization. In a world full of disputes between and within nations, it is particularly important to have tools for studying the latent connectedness between people with disagreements and even hatreds, but who might be more tolerant of each other if they knew what connections they had in common.

I have little to add to the model or the statistical analysis except to point to the work of Watts *et al.* (2002), who noted that the social network is actually a union (i.e. overlapping) of networks from family, friends, church, work and so forth. Ideally, I think that a model of the social network would model these separate components. Along with this is the notion that networks evolve dynamically, with processes such as the completion of open triangles (if Ann knows Bob, and Bob knows Carl, then Ann is likely to meet Carl at some point); see, for example, Kossinets and Watts (2006). Perhaps the model of Handcock and his colleagues can be generalized to allow this time component (with the time points treated as latent data if they are not observed).

Finally, I encourage the researchers to think harder about how to present numbers such as those in Tables 1 and 3; for example, should we care that the estimate of *β*_{0} under a particular model is ‘3.475’? For future work in this area, I recommend thinking carefully about what comparisons are of interest and then presenting the results graphically to learn about these comparisons (see Gelman *et al.* (2002)).

**Steven M. Goodreau** (*University of Washington, Seattle*)

The authors’ work has many potentially important applications, of which two stand out for me. One is as an exploratory mechanism for understanding cases in which subpopulations that are defined by exogenous attributes are expected to form cohesive groups but do not. For example, in on-going work, colleagues and I are examining 59 of the school groups in the adolescent health study (the same study from which the authors draw a simple example) and are discovering that some forms of homophily and transitivity are both of universal importance. However, some groups display much more heterogeneity in their cohesiveness than others, most notably Hispanics and, to a lesser extent, native Americans and Asian Americans. We have considered several reasonable predictors for when these groups do or do not exhibit dyadic and triadic level cohesion, with limited success. The work that is presented in this paper could provide a novel method for distinguishing between the multiple ways that such groups can fail to be cohesive through analysis of both latent clusters and positions. Do they form more than one distinct subgroup, each of which is itself relatively cohesive? Or do different members of the population cluster tightly with other subgroups? Is there a large amount of uncertainty in cluster membership for some subset of actors? The latent position cluster model should allow us to distinguish between these possibilities in a statistically grounded way.

A second, and perhaps more widely applicable, use of these models could be as a tool for exploring goodness of fit. Social network modellers are often interested in knowing when their models have managed to capture all of the relevant structure in a network, but traditional goodness-of-fit measures often do not provide a clear answer to this question. Although recent advances have been made in this area (e.g. Hunter *et al.* (2007)), such approaches necessarily require decisions about which particular features of network structure are important for a well fitting model to capture. The current approach would seem to provide an additional general method: adding the latent position cluster model as a ‘residual’ to a structural model to identify any remaining structure or clustering. Conceptually, such an approach would not only tell us whether such structure remained but also provide a sense of its nature. This would be an important addition to our toolkit for assessing model fit, which is so far an underexplored area of network analysis.

**Priscilla E. Greenwood** (*Arizona State University, Tempe*)

The beauty of the latent position cluster model is that no spatial setting is introduced. The setting is completely abstract, which means that the data set freely introduces its own structure via the inference step. The basic idea seems extremely flexible and can be applied with various regression models instead of model (2), and with other multivariate cluster–shape distributions replacing model (3).

The authors mention a possible epidemic interpretation. Suppose that a contagious disease runs its course in a community, and each individual tells us from whom he contracted the ailment. The method that is presented here can be used to infer the cluster structure of the epidemic. Epidemic parameters can be estimated simultaneously. This will be an interesting tool in spatial epidemic theory. Although epidemic data are rarely available in the form of directed ties, the method can be used as a simulation step. This example suggests two natural extensions of the latent position cluster model.

Suppose that we add a time structure to the model. Then, since the data give directed ties between nodes, we shall be able to infer a time ordering along the paths of the graph that is formed by these links, even though the directions in the data will not always be consistent. The result would give information about the evolutionary path of the epidemic through the community. Let us consider a genomic context. Inference about clusters together with time ordering, in a graph that is constructed from an alignment of homologous genes from several species, would produce a postulated phylogenetic structure.

A second natural extension would be to use the degree aspect of the data, the number of ties coming from each node, as an ingredient in the estimation of the number of clusters. In the epidemic context the degree data could be used for inference about the contagion parameter, either as a constant over the graph or locally within the clusters.

**Katharina Gruenberg and Brian Francis** (*Lancaster University*)

We congratulate the authors for having written such an inspiring paper. We would, however, like to point out two possible extensions which may arise out of real life data. The first relates to directed links. It is possible to measure links on a scale that allows positive as well as negative values for sociomatrices. Allowing for negative values allows the simultaneous modelling of ‘like’, ‘dislike’ and ‘indifference’. For instance with the monk data we could investigate whether there are no ties between the ‘Outcasts’ and the ‘Loyal Opposition’ or whether their relationship is in fact one of possible mutual dislike. Alternatively, the same model could be used to model the reciprocal of dislike. The second point refers to the nature of social networks—homophily does not always exist in networks. People may be attracted to each other if they exhibit opposing ideas—the idea that opposites attract. Such relationships might exist in prison. Adapting the model to allow both for ‘opposites attracting’ and ‘similars attracting’ is a new challenge.

**Christian Hennig** (*University College London*)

I congratulate the authors for this very stimulating paper. I would like to contribute some thoughts about the model assumptions.

- (a)
The authors discuss the transitivity that is imposed by their model, particularly by the triangle inequality which is assumed to hold in the latent social space, but they do not explain how to check this model assumption. One possibility could be to apply a parametric bootstrap, i.e. to simulate new data sets from the fitted model and to compare the bootstrap distribution of the number of ties in triads with its observed value.

- (b)
It could be useful in some situations to allow more general covariance matrices within clusters. This enables, for example, elongated clusters, which have the reasonable social interpretation of modelling a group as spreading between two extreme points, which is not captured by spherical clusters.

- (c)
It could be useful to include the so-called ‘noise component’ as mentioned in

Fraley and Raftery (1998) in the cluster model, because individuals who do not belong to any cluster may be found in many social networks.

**Peter D. Hoff** (*University of Washington, Seattle*)

Latent variable models of social networks can be motivated in a natural way: for undirected data without covariates we might view the nodes of a social network as being exchangeable, so that

for any permutation *π* of the node labels. Aldous (1985) has shown that all such data can be expressed as *Y*_{i,j}=*g*(*μ*,*z*_{i},*z*_{j},*ɛ*_{i,j}), where *g* is symmetric in its second and third arguments. Thus, the variation in any exchangeable sociomatrix can be represented with node-specific latent variables {*z*_{1},*ldots*,*z*_{n}} and pair-specific noise {*ɛ*_{i,j}}. Bayesian estimation for the stochastic blockmodel of Nowicki and Snijders (2001) and the latent position model of Hoff *et al.* (2002) can be viewed as special cases of this general latent variable model. In the former, the *z*s are latent classes and *g* maps pairs of classes to between-class interaction rates. In the latter, the *z*s are vectors and *g* involves the Euclidean distance between them.

The stochastic blockmodel and the latent position model represent extremes of simplicity and complexity: the stochastic blockmodel implies that, conditionally on the values of the *z*s, all nodes within a common class share the same distribution over relationships. In contrast, the standard latent position model gives a different distribution for every node. The latent position cluster model that was presented by Handcock, Raftery and Tantrum nicely fills a void between the two approaches: a set of similarly acting, well-connected nodes will be identified and represented as a tight cluster, whereas nodes with unique behaviours will not be forced into ill-fitting groups.

As described in the paper, latent position models represent homophily. But this homophily is confounded with stochastic equivalence: similar values of *z*_{i} and *z*_{j} imply that *i* and *j* are likely to have a tie (since |*z*_{i}−*z*_{j}| is small) and also have similar relationships to other nodes (since |*z*_{i}−*z*_{k}|≈|*z*_{j}−*z*_{k}|). This correspondence is often present in friendship networks, but absent in networks such as the World Wide Web, in which ‘hubs’ connect to similar groups of nodes but not to each other. To separate homophily from structural equivalence we might consider an ‘eigenvalue decomposition’ model as described by Hoff (2006), in which the probability of a link between *i* and *j* is related to the form . By allowing entries of Λ to be either positive or negative, such a model can exhibit structural equivalence with or without homophily.

**David R. Hunter** (*Penn State University, University Park*)

The paper by Handcock and his colleagues provides an interesting and important extension of the latent space model of Hoff *et al.* (2002). A different extension of this work—which may also be applied to the current paper—allows for more explicit modelling of local network features, such as transitivity, by using an exponential random-graph model (ERGM).

If the matrix **y** denotes the entire network (i.e. the collection of all *y*_{i,j}), then equation (2) implies that

- (11)

where *κ*(*β*_{0},*β*_{1}) is a normalizing constant.

Conditionally on the latent positions *Z*, the resulting model,

- (13)

is evidently a canonical exponential family (see, for example, Lehmann (1983)) of distributions parameterized by (*β*_{0},*β*_{1}) with statistics *g*(**y**,*X*) and *h*(**y**,*Z*). Therefore, conditionally on *Z*, model (13) is an ERGM (‘graph’ here is a synonym for ‘network’). Snijders (2002) and Robins *et al.* (2007a) give literature reviews of these models, which are also called *p*-star models in the literature.

Importantly, model (3) is still an ERGM, conditional on *Z*, if the vector *g*(**y**,*X*) of network statistics of interest is not of the form (12) that allows the likelihood function to factor nicely as in equation (11). The simplest such ‘non-factoring’ models were considered by Frank and Strauss (1986), in which *g*(**y**,*X*) contained terms such as the number of triangles in **y**, Σ_{i<j<k} *y*_{i,j}*y*_{j,k}*y*_{k,j}. Much recent work in the social networks literature has focused on development of useful statistics *g*(**y**,*X*) for modelling real network data (Snijders *et al.*, 2006; Robins *et al.*, 2007b), as well as explaining why some statistics, such as the number of triangles, lead to ERGMs that fail miserably at modelling these data (Handcock, 2002, 2003).

Model (13) would give the modeller a powerful tool for exploring network structure: for instance, if the latent positions and cluster assignments of the nodes change dramatically on the introduction of a particular network statistic into the ERGM, this suggests that the statistic captures an important aspect of network structure. Yet estimating parameters in a model such as equation (13) is quite difficult when *g*(**y**,*X*) is not of the form (12). In principle, the two-stage maximum likelihood estimation of Handcock and his colleagues should work, though the second stage would rely on a stochastic algorithm that is based on Markov chain Monte Carlo simulations such as those described by Hunter and Handcock (2006) or Snijders (2002). The Bayesian scheme that is implemented here is promising, but establishing a reasonable prior for the ERGM parameter *β*_{0} is difficult. Despite the remaining challenges, this paper is a real step forwards.

**Dirk Husmeier and Chris Glasbey** (*Biomathematics and Statistics Scotland, Edinburgh*)

The authors have contributed an intriguing and stimulating paper to the growing literature on the statistical analysis of network structures. The model also provides a tool for visualizing networks, beyond existing algorithms such as Cytoscape (Shannon *et al.*, 2003).

Networks are of burgeoning interest in many fields, not least in post-genomic biology (see, for example, Wang and Chen (2003) and Milo *et al.* (2004)). Some biological interaction networks violate the underlying model assumptions, though. For instance, transcription factors regulating sets of unconnected genes and non-directly interacting proteins bound by the same protein recognition modules both lead to a violation of the transitivity condition. For this reason, model diagnostics would be a welcome addition to the work. Given that the authors have proposed a probabilistic generative approach, the application of diagnostics such as Bayesian *p*-values should be straightforward.

The model proposed could, in principle, contribute to post-genomic data integration. Consider, for instance, a situation where protein interactions inferred from yeast two-hybrid experiments are complemented by ribonucleic acid concentrations from transcriptional profiling with microarrays. The model allows us to infer the intrinsic trade-off between these two noisy and disparate data sets via equation (2), by treating the ribonucleic acid profiles as covariates and weighting their influence against the protein interactions via the two hyperparameters *β*_{0} and *β*_{1}.

The authors approach the inference problem in terms of a hierarchical Bayesian model, sampling parameters from the posterior distribution with a Gibbs and Metropolis-within-Gibbs scheme, which is sound. Less sound, however, is inference on the number of clusters. The marginalization of the likelihood is carried out with respect to the parameters, but not the latent variables (i.e. the *Z*_{i}s). Also, the Bayes information criterion approximation in equation (10) is rather restrictive. The Bayes information criterion assumes that the posterior distribution is multivariate Gaussian, ignoring differences in the eigenvalues of the covariance matrix, and the approach is hence compromised to the extent that this assumption is violated. Although a full reversible jump Markov chain Monte Carlo scheme might be computationally prohibitive, variational methods, which are currently very popular in the machine learning community, would presumably provide a much better approximation to the integration and might therefore provide a promising avenue for future research.

**David Krackhardt** (*Carnegie Mellon University, Pittsburgh*)

First, I want to emphasize the importance of the problem that is addressed by this paper. Cleanly identifying clusters of actors in a social system on the basis of their social ties is an age old pursuit of generations of scholars, from sociologists and psychologists to mathematicians (e.g. Luce and Perry (1949) and Cartwright and Harary (1956)). UCINET (Borgatti *et al.*, 2002), which is the most commonly used package for analysis of network data, has 20 distinct methods for finding clusters or groups, each with a plethora of suboptions and choices of parameter which, depending on the data, may yield wildly differing results. This dizzying array of ‘solutions’ begs the central question: given the observed data, what is the right number of clusters and what is their composition? Using the Bayes factors approach to answer this critical question statistically is a major step forwards out of this intellectual morass.

The paper quickly leads me to ask a couple of extending questions. First, how sensitive is this procedure to violations of the assumption on independence of dyadic observations? We know that even moderate amounts of ‘network autocorrelation’ in the data can dramatically affect estimates of standard errors and concomitant inference tests in traditional analytic procedures (Krackhardt, 1988).

Second, should we rely on empirical demonstrations of the model to provide us with evidence that the procedure is uncovering the true, underlying group structure? The fact that the procedure recovers the same structure in the Sampson data as other prior analyses could be because the networks are so clearly clustered that it does not matter what hammer you use to pound the data; they will always reveal the same story. In the case of ties between adolescents, the fact that their method cleanly shows discrimination between grades is interesting, but does that mean it was more accurate? Suppose that the result had not fallen along grade lines. Would that mean that the method was not accurately assessing real underlying clusters? Or, would it mean instead that networks were clustering on some other criteria?

Both of these questions could be addressed with appropriate Monte Carlo simulations. The advantage of such simulations is that you have control over ‘truth’, and by adding precise, known, and yet complex structures of noise, we can directly assess how well the proposed Bayesian method recaptures this underlying truth. Such simulations would help us to delineate the boundary conditions within which their method is truly powerful.

**Jouni Kuha and Anders Skrondal** (*London School of Economics and Political Science*)

We would like to query the authors’ decision to advocate a method of model selection which is incoherent with their Bayesian estimation approach. In Section 4, they propose choosing the number of clusters by using the Bayes information criterion (BIC) statistic, calculated conditionally on posterior estimates of the latent positions. This uses maximum likelihood rather than maximum posterior estimates of the parameters and implicit prior distributions which differ from the priors that are specified in Section 3.2. It is thus only the estimated positions which actually depend on the results of the Bayesian estimation.

Would different results be obtained from an approach which was more consistent with the specified Bayesian model? To examine this, we generated 18 latent positions *Z* (scaled to have unit root mean square, as in Section 2) from a three-cluster model with parameters equal to the posterior medians in Table 1. Conditionally on these *Z*, we calculated, for models of 1–4 clusters, two approximations of 2 log {*P*(*Z*)}: the Laplace approximation (see equation (4) of Kass and Raftery (1995)) and the rougher BIC statistic of Section 4. The former depends on the posterior mode of the parameters and on the prior distributions that are actually used for the Bayesian estimation. For simplicity the values of *Z* were here the same for each model. In this case, *P*(*Y*|*Z*) does not depend on the number of clusters, so model comparison is based on *P*(*Z*) only.

Fig. 15 shows the values of the BIC and Laplace approximations of 2 log {*P*(*Z*)}. Here both statistics correctly select the three-cluster model but there are some striking differences between them. The Laplace approximation imposes a larger penalty on the log-likelihood, so the prior distributions that are specified in Section 3.2 are actually less informative than the unit information priors that are used for the BIC. The effect of this is to increase the posterior probabilities of simpler models compared with the three-cluster model. In general, this means that a Bayesian model selection criterion based on the prior distributions of Section 3.2 may choose a model with fewer clusters than the BIC that is proposed in Section 4.

Having already used the Markov chain Monte Carlo machinery for Bayesian estimation, it would be natural to obtain direct estimates of Bayes factors as a by-product (without conditioning on ), instead of employing rough approximations of them. It seems plausible that a coherent approach that is based directly on Bayes factors would often favour smaller numbers of clusters than the approach which is considered in the paper.

**Andrew Lawson** (*University of South Carolina, Columbia*)

This paper is a very interesting example of the application of cluster modelling to a network domain. I have a few comments on the work.

First, the authors make a very strong parametric assumption about the latent positions in the social space in that

Here, the cluster form is forced to be symmetric, multivariate normal and around a mean level vector {*μ*_{g}}. They also assume that the components are independent. In the subsequent model fitting, these assumptions appear to be unchallenged and yet it could easily be argued that there is a need for asymmetry and irregularity in social spaces. In other clustering applications this is not enforced (see for example Kim and Mallick (2002)). I am aware that the mclust software makes such assumptions, and so this affects the convenient implementation of the model. Have the authors considered relaxing these assumptions or examined the sensitivity of the model to these parametric restrictions?

Second, in the Bayesian model that is described in Section 3.2 the authors appear to fix certain parameters. For example, the *β*-parameters have fixed and relatively narrow variances, whereas in many Bayesian regression contexts these would have hyperpriors. This fixing of the hierarchy could lead to differences in estimates. Another example is the use of fixed variance priors for the mean parameters. Overall, this hierarchy truncation could be significant. Can the authors comment on the need for such truncation in their formulation? Does the implementation depend on this truncation?

Third, with regard to reversible jump and fixed *G*, the authors appear to avoid the idea of reversible jump sampling to allow for a dimension change in *G*. Indeed they do not even discuss the possibility. In addition, it is not clear from the paper whether *G* is fixed or allowed to vary. Clearly it would be feasible to assign a prior distribution for *G* and to sample it. Another possibility would be to formulate a binary variable selection model where

with *ψ*_{g} a Bernoulli selection variable which can be sampled. These choices appear to be simpler than the approach that is advocated by the authors.

**Tim F. Liao** (*University of Illinois, Urbana*)

I congratulate Handcock, Raftery and Tantrum for their contribution to the statistical literature on social network analysis. The model-based approach to analysing social network data represents a great leap forward: the method effectively overcomes the major disadvantage of previous methods where clique or cluster memberships are known, an assumption that is required by either the deterministic or the stochastic version of blockmodelling. I would like to focus on a potentially useful extension of the latent position cluster model, one that further relaxes the cluster membership assumption of the current method.

Cluster memberships can be defined as fuzzy rather than crisp and modelled as such. Two research traditions paved the foundation for this thinking. The sociological literature has long established that groups intersect within the person (Simmel, 1955), suggesting that one person can belong to multiple clusters, The idea was revisited in early social network analysis by Breiger (1974). In mathematics, Zadeh (1965) started a long line of research on fuzzy sets, with useful applications in engineering and in statistics (see, for example, Manton *et al.* (1994)).

The current method can easily extend to the use of grade of membership (or fuzzy membership) in estimating latent clusters. This can be achieved by defining uncertainty in cluster membership as a function of an actor's fuzzy membership, or *q*_{ig}=*f*{*μ*_{A}(*i*)}, where *μ*_{A}(*i*) is the membership function of actor *i* in cluster A. Similar functions can be defined for clusters B, C, etc. Therefore, one actor may belong to multiple clusters to varying degrees. For the current examples, whereas the Sampson data may not need this extension (Fig. 3), the social network data from the National Longitudinal Study of Adolescent Health would more probably benefit from a fuzzy operation (Fig. 8).

As this should be a rather natural extension, I hope to see it developed in a sequel paper and implemented in a new version of latentnet.

**Bruno Mendes and David Draper** (*University of California, Santa Cruz*)

In their Bayesian fitting method the authors use conditional posterior model probabilities, which (as usual) are based on integrated likelihood values, and (as usual) integrated likelihoods can be highly sensitive to the manner in which diffuse prior distributions on the parameters of each model are specified (and this sensitivity can persist even with large sample sizes). If the authors had used a Laplace style *O*(*n*^{−1}) approximation to the logarithm of the integrated likelihood, they would have had to face this instability directly, because terms of the form (where is the maximum likelihood estimator or mode of the posterior distribution *p*(*θ*_{j}|*y*,*M*_{j}) for the parameter vector *θ*_{j} specific to model *M*_{j}; here *y* are the data) would arise in the Laplace approximation and could easily vary unstably as a function of the details of the diffuse prior specification. They appear to avoid this problem by using a cruder *O*(1) approximation based on the Bayes information criterion, in which prior specification details are swept under the rug. The something-for-nothing bell is ringing in the background here: apparently one can get around a fundamental difficulty (which does not necessarily go away as the amount of data increases) with integrated likelihoods just by adopting a cruder approximation to them. Perhaps the authors can clarify.

**Gesine Reinert** (*University of Oxford*)

The authors are to be congratulated on their paper; it provides a novel approach linking statistics and social network analysis.

An interesting tangent is the recent statistical physics approach to networks. The most basic such construction is the *Watts–Strogatz model*, where random shortcuts are added to a fixed lattice, the end points of the shortcuts being chosen uniformly. Slightly more complicated models arise in network growth models, where a new vertex creates a (fixed or random) number of links to existing vertices, with possibly preferential attachment rules. These classes of so-called *small world networks* are claimed to provide suitable models for social networks such as scientific collaboration networks and Internet dating networks; for an overview see for example Dorogovtsev and Mendes (2003).

As customary in statistical physics, many of the small world network results are of asymptotic nature. In contrast, the problems that were studied by the authors involve only a small number of vertices, so asymptotic regimes may not be of any direct interest. In addition, small world network models may not capture some of the important features in the data. Yet even asymptotic small world network results could potentially be relevant, not only because increasingly more large social network data sets become available, but also because such results help to understand better the qualitative behaviour of networks. An example are results on the emergence of a giant cluster (see Durrett (2006) for an introduction and Bollobas *et al.* (2006) for recent progress), which could relate to percolation-based clustering algorithms as in Sasik *et al.* (2001).

Beyond Bernoulli random graphs, for all these network models assessing model fit remains an open question. Often networks are summarized by using the clustering coefficient, the average shortest path length, the average vertex degrees or the number of occurrences of certain network motifs. Ideally (at least asymptotic) distributions for some summary statistics would be available to derive parameter estimates and to develop rigorous statistical tests for model fit. For Watts–Strogatz small world networks a few results can be found in Barbour and Reinert (2001, 2006), but much more research on these issues is needed.

Although networks have recently received considerable attention not only from social scientists but also from statistical physicists, few statisticians have taken up the challenge of contributing to this field. The paper under discussion might help to reverse this trend.

**Sylvia Richardson and Alex Lewin** (*Imperial College London*)

The authors are to be congratulated on their stimulating paper that will foster the application of the same ideas in many different domains.

Our comments relate to two aspects.

- (a)
The authors hardly discuss the choice of the dimension *d* of the latent space and their examples always use *d*=2. It would be reasonable to expect a close relationship between *d* and the number of cluster, a relationship that is not discussed. For a large complex network, if the dimension chosen is too low, then possibly more clusters will seem to be necessary. Could *d* be included as a parameter in the analysis so that joint inference is made on *d* and the number of clusters?

- (b)
It is unclear to us whether the additive formulation (2) is the best way of including homophily on observed attributes. There is an interplay between the effect of covariates on the log-odds for links and the latent space which captures the effect of hidden characteristics of the social actors. A useful parallel can be drawn with ecological regression and spatial patterns of disease. Usually the ecological regression equation for relating the underlying log-relative-risk for a disease in an area

*i* to area-specific covariates

*X*_{i} is written as

where

*s*_{i} is a Markov random field that captures

*residual latent spatial structure* that is not accounted for by the covariates. This assumes no interaction between the covariates and the spatial space, and when this is not reasonable an interaction with space is considered instead, i.e.

*β*_{0} becomes indexed by

*i* (see for example

Gelfand *et al.* (2003)). For social networks, we feel that such interaction is likely and, hence, including covariates solely as a fixed effect might not be appropriate. For example, you might expect girls and boys to mix more in older years and therefore the influence of gender similarity to be different for each age cluster. Thus a more realistic model would investigate whether the homophilic effect of the covariates differs for different clusters. A useful extension of

equation (2) might thus be

where

*δ*_{i} is the allocation label in the clustering for actor

*i*. We believe that such an extension would enhance the capacity of the model to account for complex network structure and we would welcome the authors’ thoughts on the interplay between covariates and the cluster structure.

**D. M. Titterington** (*University of Glasgow*)

I have two comments about this interesting paper.

If we denote the set of all group memberships by *K* and let *φ* denote all parameters, then a key factorization is

- (14)

In the paper a specialized version of this is used, corresponding to

- (15)

Furthermore, if there is no covariate information *X*, as is the case in both main examples, then this becomes

- (16)

The Bayesian calculations in Section 3.2 essentially use the formula in equation (15), together with a prior *P*(*φ*), as a basis for estimating *P*(*Z*,*K*,*φ*|*Y*,*X*) by using Markov chain Monte Carlo sampling. In contrast, in the first stage of the two-stage method in Section 3.1, the method of Hoff *et al.* (2002) takes the first factor on the right-hand side of equation (15), namely *P*(*Y*|*Z*,*X*,*β*), where *β* is the part of *φ* corresponding to that factor, and maximizes it with respect to *Z* and *β*, with the resulting being referred to as the ‘maximum likelihood’ estimates of the latent positions. My first comment is to indicate some anxiety over this, because it is well known that treating ‘missing values’, such as latent scores, as ‘parameters’ in this way can lead to problems such as biases in the estimators of the genuine parameters; see for example Little and Rubin (1983) and Marriott (1975). However, I concede that the normative maximum likelihood approach would be computationally difficult. It would involve using the EM algorithm to estimate *φ*, with complete-data likelihood given by whichever of equations (14), (15) and (16) is appropriate, and then obtaining values for the latent positions *Z* in the same spirit as the calculation of factor scores in factor analysis.

This brings me to my second point. Suppose, in contrast with equation (15), we factorize *P*(*Y*,*Z*,*K*|*φ*) as

If the variables in *Y* are continuous and if *P*(*Z*) corresponds to *Z*∼*N*(0,*I*), then this gives the mixture of factor analysers model for *Y*; see Ghahramani and Beal (2000) and Fokoué and Titterington (2003), and also Fokoué (2005) for a version that incorporates *X*-like covariates. It would be appropriate to describe the version with binary variables in *Y* as a mixture of latent trait models (Bartholomew, 1987) or a mixture of density networks (MacKay, 1995). I wonder whether this variation would produce interesting results in the contexts that are covered by the paper. I suspect probably not, at least so far as interpretability is concerned, but the relationship between the two types of model may be of interest.

**Stanley Wasserman** (*Indiana University, Bloomington*)

At a reception about 10 years ago, part of a memorial tribute to Cliff Clogg at Penn State University, a well-known, and very good, statistician–sociological methodologist chatted with me about social network analysis. I was surprised that this well-known person, after telling him that I did network analysis, had the view that network analysis was just a bunch of indices, with little thought given to statistical models. And, I felt the weight of his accusation. He was incorrect, of course, but it was a common misperception at that time.

There is no question that network analysis has come a long way over the past decade, spurred on to some extent by the many researchers doing statistical modelling and ordinary people who are interested in networks. Here in the States, there is even a new television show on the ABC network named ‘Six degrees’. With networks pervasive in our 21st-century popular culture, it is pleasing to know that we are learning what to do with network data. The paper under discussion here, by Professor Raftery, Professor Handcock and Dr Tantrum, is a very fine piece of mathematics, a perfect example of the growth of the discipline. It certainly advances network science, but it leaves me with a few questions.

First, what will be the fate of this clever model? Will it be ignored, as Hoff *et al.* (2002) has been? Many statistical approaches to networks (such as correspondence analysis and stochastic blockmodels (Wang and Wong, 1987)—which include the models that are described here) have been little used. Could a mere mortal, a social networker from, say, social work, fit this model? A friend of mine years ago remarked that network data are more complicated than the models that are used to study them. I think that the opposite is now true.

Second, what has happened to network *data analysis*, as opposed to statistical network modelling? Sure, we have great models and the computing ability to fit them by using appropriate and correct estimation techniques, but very little thought has gone into questions such as ‘why this model and not that one?’. How does model A compare with models B–Z on a wide range of data sets? The authors ignore this issue. It may take years to answer questions such as these. We are just now making progress on understanding the exponential family *p** (using good ideas such as those in Goodreau (2007)), but we need more data-oriented papers such as Holland and Leinhardt (1981).

Network research needs more of Cliff Clogg, a good sociologist and a superb statistician, who cared about data, and less of Bayesian formalism.

The moral of my story at the beginning of my comment is that this very person is now doing research on excellent, and sophisticated, network models. I feel partially vindicated!

**Adriano Velasque Werhli** (*Biomathematics and Statistics Scotland, Edinburgh*) **and Peter Ghazal** (*Scottish Centre for Genomic Technology and Informatics, Edinburgh*)

Although the title of the paper suggests that the method proposed is restricted to the analysis of social networks, it is interesting to investigate whether it has the scope for wider applications to biomolecular interaction networks. For this we have applied the algorithm to a genetic network that is related to the action of interferons, which play a pivotal role in modulating the innate and adaptive mammalian immune system. The network is shown in Fig. 16(c). We have applied the algorithm in the same way as described in the paper, applying standard diagnostics to test for convergence of the Markov chain Monte Carlo simulations. Fig. 16(a) shows the positions of the nodes in the latent space, obtained for the number of clusters with the highest marginal likelihood score (ngroups=3). It is obvious that no clear cluster formation is found, and the cluster assignment that was predicted was not biologically meaningful.

The reason for the failure of the algorithm becomes clearer when investigating the interferon gamma pathway more closely. There are various hub nodes connected to sets of peripheral nodes that are not themselves interconnected, and this violates the transitivity assumption on which the algorithm is based. To put this to an empirical test, we modified the interferon network as follows. We identified seven central regulators (i.e. hub nodes): Statl, Irfl, Irf7, C2ta, Irf3, Irf2 and Irf4. For each regulator, we completely interconnected all the regulated genes with bidirectional edges and, in addition, introduced bidirectional edges between the regulators and regulatees. This is to ensure the formation of clique structures that satisfy the transitivity condition. We then applied the method proposed to the modified network. The resulting positions of the nodes in the latent space are shown in Fig. 16(b). Fig. 16(c) shows the original network, where the shading of the nodes indicates the cluster membership (again, we used the number of clusters that maximizes the marginal likelihood: ngroups = 4). The cluster formations are now much more distinct and are clearly related to the regulators and their regulated genes. (There is no perfect agreement owing to interconnections between the cliques and violations of the transitivity condition in other parts of the network).

This analysis indicates that the algorithm proposed is not generalizable to molecular biological interaction networks that inherently violate the transitivity condition.

The **authors** replied later, in writing, as follows.

We thank all the discussants for their stimulating comments. The large number and wide range of discussions suggest that the statistical analysis of social networks is a developing area that is poised for rapid growth. Many potential applications were mentioned, including to epidemics (Greenwood), post-genomic data (Husmeier and Glasbey), biomolecular interaction networks (Werhli and Ghazal) and rank data (Gormley and Murphy).

We appreciate the many positive comments about the latent position cluster model. In particular, we would underline Snijders's comment that latent structure models allow data that are missing at random to be handled almost trivially. This point was not made in our paper, and it is important because often much of the data about a network of interest is based on network sampling or subject to out-of-design missingness.

##### Social network characteristics

Our model was designed to take account of homophily on observed attributes, transitivity and clustering, but it did not incorporate other important features. One of these is what Snijders calls prominence and is also referred to as activity, sociability or popularity, namely the fact that some actors tend to send and/or receive more links than others, sometimes by a large margin. Greenwood emphasizes the importance of this for applications to infectious disease epidemics. Note that in our examples this was not an important feature of the data, as can be seen from Figs 1 and 4, for example. This was in part because the data collection method discouraged it; for example, the school students in the adolescent health data set were invited to name no more than 10 friends, and most did name close to that number, so the tendency to send links did not vary greatly between students.

It seems most natural to allow for this by adding random sender and receiver effects to equation (2) of our paper; this would be a small technical modification to the model. This is similar to the specification of random effects in the *p*_{2}-model by van Duijn *et al.* (2004), as pointed out by Snijders and van Duijn. Hoff suggests an eigenvalue decomposition model (Hoff, 2007) as an alternative. This seems less easily interpretable than a random-effects model but allows the separation of homophily and structural equivalence. In our experience, it is also computationally efficient, which is important for scaling the methods to larger networks.

Another important feature that our model does not include is what Snijders calls hierarchy and Butts calls asymmetry, namely the tendency in a given dyad for one member to send links and the other to receive them. As Snijders points out, this can also be represented at least partly by sender and receiver random effects, Butts suggests a simple and elegant generalization of this idea, in which each actor has a latent, possibly vector, ‘vertex potential’. This is an important contribution. Overall, we agree with Snijders's view that activity and popularity dimensions should be included by default in latent space modelling of social networks, and this is easy to do in our modelling framework.

Breiger, and Gruenberg and Francis point out that some social networks exhibit negative affect, as a result of which people with opposite attributes tend to attract. The clearest example is that of heterosexual sex networks. When the relevant attributes are observed, as in the sex network case, this could be dealt with naturally in our model by the -term in equation (2), where *x*_{i,j} represents dissimilarity (e.g. being of the opposite sex) and *β*_{0} is positive. It is indeed a challenge to adapt the model to the situation where opposites attract and the relevant attributes are unobserved, as Gruenberg and Francis remark.

Lawrance and Krackhardt ask how sensitive our results are to the conditional independence assumption in equation (1). We think that the answer is ‘not very’. In our model, links are conditionally independent given the unobserved latent variables *z*_{i}; thus unconditionally they can be highly dependent. Indeed Hoff, citing Aldous (1985), points out that all social network data of this type can be represented as conditionally independent given some actor-specific latent variables, which provides some theoretical basis for thinking that the conditional independence assumption is not restrictive. As noted by Hunter and Snijders, this assumption can be tested by incorporating a more general exponential random-graph model (ERGM).

##### Model-based clustering specification

Our model specifies the distribution of latent positions within a cluster to be multivariate normal with a spherical covariance matrix. Robinson, Forster, Hennig and Lawson ask whether we could relax this assumption to allow a more general, non-spherical covariance matrix. We did experiment extensively with such a model and found that the results were often unstable and difficult to interpret. This seems to be because the amount of information in the data that is used to define, say, a cluster of seven monks is actually quite small, consisting of a small number of binary observations, and is not enough to specify a general covariance matrix with adequate precision. With the simpler model that we used, we did obtain stable and interpretable results.

Nevertheless, it is possible that in some cases a non-spherical model could be useful. For example, Bearman *et al.* (2004) reported ‘chaining’ effects in romantic networks of adolescents, and it is possible that such clusters could be represented by long thin mixture model components, with covariance matrices that have a high ratio of largest to smallest eigenvalues.

Robinson suggests the use of non-Gaussian components in the mixture model, and Atkinson and Longford point out that a mixture of normal distributions can represent a non-Gaussian shape rather than clustering. Our experience, however, is that network data do not provide enough information to support the use of non-Gaussian components or to lead to the use of more than one Gaussian component for a single cluster. These issues can be important for clustering observed data, but they seem much less relevant when clustering latent positions that are not very precisely determined by the data.

Hennig suggests the addition of a low intensity uniform noise component to the mixture model (3) to represent isolated actors with few or no links. This is an excellent idea, as isolated actors are common in social networks and are difficult to model. The use of a low intensity uniform noise component to represent outliers in model-based clustering was proposed by Banfield and Raftery (1993), and Hennig (2004) has shown that it leads to methods with good classical robustness properties when applied to observed data.

##### Choice of distance

We used the Euclidean distance between latent positions to specify our model. This has the advantage that the resulting positions can be represented in Euclidean space, which is useful for visualization and interpretation. However, as Snijders points out, other distances, such as the ultrametric, could also be used.

Breiger and Gelman point out that in practice people belong to multiple networks, and that our model does not account for this. One way to do so could be to change the dissimilarity measure in the latent space model. For example, we could replace the Euclidean distance |*z*_{i}−*z*_{j}| in equation (2) by the co-ordinatewise minimum dissimilarity, min_{k}|*z*_{ik}−*z*_{jk}|, where the *z*_{ik} are the co-ordinates of *z*_{i}. This could be interpreted as follows. If each co-ordinate corresponds to a different component network (family, friends, work, neighbourhood, etc.), then proximity on any one of them will be enough to make the chance of a link high. For example, if Bob works in the same office as Carl but lives far from him, they are almost as likely to form a link as if they lived closer. Each network could be specified by more than one co-ordinate. This should not make the estimation problem more complex.

##### Extensions

Besag, Breiger and Gelman point out that networks evolve dynamically and that extending the model to incorporate this would be useful. Greenwood points out that this is particularly important for modelling infectious disease epidemics. We agree, and a start has been made on this by Westveld and Hoff (2005).

Another important extension is to the case where links are not binary, but quantitative, e.g. a measure of how friendly Bob and Carl are rather than just whether or not they are friends. Such a measure could be continuous valued, or a count, such as the number of times that they meet per month. It could also be categorical, for example, if the link could be negative (e.g. dislike) or positive, as pointed out by Gruenberg and Francis. Such an extension can be readily accommodated within our framework, by replacing the logistic regression of our equation (2) by another response model. This could be a generalized linear model, as proposed by Hoff (2003). See also Oh and Raftery (2003).

Snijders and Hunter suggest combining the latent position cluster model with the ERGM class. This would allow the direct representation of structural signatures hypothesized under social theories (e.g. triadic balance). The interpretation of the latent component of the model then changes as it then represents residual social structure. Steps in this direction have been made by Handcock *et al.* (2003b). One way in which this combination would be immediately useful is suggested by van Duijn's discussion. Our model does not fully model reciprocity. This could be done by replacing our equation (2) by a bivariate binary response model for *y*_{i,j} and *y*_{j,i} jointly, where the model is a *p*_{2}-model as specified by van Duijn *et al.* (2004), but specified conditionally on the distance |*z*_{i}−*z*_{j}|.

Robinson, and Richardson and Lewin raise the issue of how the dependence on observed attributes and the clustering should be jointly specified. This is an important and unresolved issue that we did not investigate in our paper beyond writing down equation (2). Social network researchers would refer to Richardson and Lewin's extension as differential homophily by cluster, and it is an interesting possibility. More basically, the question of whether dependence on observed attributes is best represented by our equation (2) at all is an open question. An alternative would be to allow the observed attributes to influence group membership probability, leading to a mixture-of-experts model, as suggested by Gormley and Murphy.

Titterington suggests a different factorization of the likelihood for our model, which suggests possible alternative models. He suspects that this would not produce interesting results in our context, and we must agree, but his discussion still places the modelling in a broader framework that could be productive.

##### Alternative models

Perhaps the most influential alternative to statistical estimation of probability models for social networks consists of models from statistical physics, such as small world networks and models that are based solely on degree distributions, as noted by Reinert; see Newman (2003). We must agree with her that these models may not capture important features of the data, but that some of the results from this literature may be helpful. We also agree that the physics literature lags in model fitting and assessment, and we note that statisticians have started to contribute here (Handcock and Jones, 2004; Handcock and Morris, 2007).

The generalized blockmodels that were discussed by Breiger and Doreian, Batagelj and Ferligoj are based on deterministic algorithms, and as such do not provide a statistical basis for estimation, inference and choice of the number of groups. Nevertheless, these results may give insight into network structure that could be useful in statistical modelling, and so we welcome the suggestion to couple the two approaches. We note that our model is not simply a stochastic blockmodel, as the actors are not structurally equivalent given the cluster membership, but are, given the latent positions. Thus members of the same cluster are heterogeneous (e.g. Victor and Romuald in the Loyal Opposition).

Airoldi, Blei and Fienberg, and Liao discuss the possible use of the grade of membership model, either in place of or in combination with the mixture model that we use here. This would allow actors to belong to several groups with different ‘grades of membership’. As Liao notes, the idea that individuals are defined via their group memberships goes back a century to the work of Simmel. These models would indeed be appropriate when the objective is to represent identity as a function of latent group memberships.

Sweeting suggested using a Dirichlet process mixture or similar model for the model-based clustering component of our model. This seems well worth investigating. Our experience with the Dirichlet process mixture model suggests that care is needed, however; see, for example, Petrone and Raftery (1997). For example, conditionally on a given number of groups, the Dirichlet process mixture prior tends strongly to favour very unbalanced groups, which may not be appropriate.

##### Model fit

We did not report much assessment of the fit of the model in absolute terms in our paper, and it is indeed important to do this. Given our estimation method, the most natural framework for this is that of posterior predictive checking (Gelman *et al.*, 1996), as alluded to by Husmeier and Glasbey. The statistics that are used for this could be descriptive network measures capturing important aspects that we want to reproduce; Reinert lists some common statistics, and Snijders suggests some new measures that could be used. As noted by Goodreau, a general framework for goodness of fit has been developed by Hunter *et al.* (2007) and Goodreau (2006), and posterior predictive checks are implemented in the software Handcock *et al.* (2004).

Hennig suggested using a parametric bootstrap, and this could be viewed as an approximation to posterior predictive checking. Snijders noted that this would be time consuming, however, and one advantage of posterior predictive checking here is that it could follow from our Bayesian estimation method with modest computational effort.

Reinert suggests using asymptotic theory to obtain the distributions of test statistics, whereas Goodreau suggests adding a latent space model as a residual to see whether there is any remaining structure. These suggestions seem worth pursuing. Goodreau's suggestion should be particularly helpful in decomposing the variation due to observed covariates, structural signatures (via ERGM terms) and residual social structure.

Draper suggests that we use ground truth to assess the model, in particular the clustering. What we did in the monks example is similar to what he suggests. The ‘known’ clustering that we used there as ground truth was based on a large amount of information, including ethnographic study by the researcher S. F. Sampson, who lived in the monastery for a year and observed the monks’ interactions closely. We know of no formal attempts at outcome validation using personal ‘truth’ but note that self-identification may not be definitive. Lawrance's suggestion that we validate the model by applying it to academic departments could give an even more compelling form of ground truth!

##### Model choice

We used Bayes factors, approximated by a version of the Bayes information criterion (BIC), to compare models. Krackhardt pointed out that there are dozens of competing methods for finding clusters in social network data that can give wildly differing results, but until now there has been no clear way to choose the best method. We agree strongly with him that using Bayes factors is a ‘major step forwards out of this intellectual morass’.

Forster, and Kuha and Skrondal suggest using an integrated likelihood from the Markov chain Monte Carlo (MCMC) output rather than our BIC approximation. Although it would seem that this should be easy, it has turned out to be surprisingly difficult to find a generic method for doing this. Raftery *et al.* (2007) reviewed this literature and proposed a criterion called BICM based on the MCMC output. We are glad that van Duijn could report favourable results with this criterion; Gormley and Murphy (2007) also reported good results with BICM in a different latent space model.

Husmeier and Glasbey say that we should have integrated out the latent positions when computing the Bayes factors, but Snijders found that what we did, i.e. keeping the latent positions fixed in the Bayes factor calculations, was reasonable. This is clearly debatable, but our argument for doing what we did seems to have been acceptable to most discussants. Husmeier and Glasbey also assert incorrectly that the derivation of the BIC assumes that the posterior distribution is multivariate normal with an isotropic diagonal covariance matrix, but in fact the result that the BIC provides an *o*(l) approximation is valid in much greater generality (Kass and Wasserman, 1995).

Kuha and Skrondal, and Mendes and Draper suggest using the Laplace method to integrate out the parameters, rather than the BIC (while keeping the latent positions fixed). This seems like a good idea, especially since the BIC is derived from the Laplace method. However, Mendes and Draper point out that the sensitivity of model choice to the prior would then be an issue. Kuha and Skrondal reported some assessment of that sensitivity and found that with our priors the resulting integrated likelihood tends to indicate less evidence than the BIC for more complex models. This indicates that our prior is in some sense more spread out than the unit information prior that underlies the BIC.

This raises an interesting point. Typically, Bayesian estimation is not much affected by making the prior flatter, but Bayesian model choice can be. Our priors were designed for estimation, so we made sure that they were at least as spread out as reasonable prior information, but we did not devote much effort to making sure that they were not too spread out, which is also necessary in the model choice situation. If the priors were to be used for computing Bayes factors more exactly, we might need to revisit them to ensure that they are not too spread out. Overall, we feel that our BIC approximation provides a reasonably simple and robust method for model comparison in the present context.

##### Choice of dimension

We used two dimensions for the latent space throughout, but it would be possible, and perhaps desirable, to make the choice of dimension data dependent. Oh and Raftery (2003) showed how to do this by using Bayes factors for a similar model, based on Oh and Raftery (2001). They found, perhaps surprisingly, that there was little interaction between the dimension of the latent space and the number of clusters. If this also holds in the present context, there would be little need for the simultaneous choice of dimension and number of clusters that was suggested by Richardson and Lewin. Note that a direct use of the BIC for choice of dimension, as done by Faust and Petrescu-Prahova, is not correct for choice of dimension here.

Raftery *et al.* (2007) applied their methods for estimating integrated likelihoods from MCMC output to precisely this problem and found that for the monks network the choice of two dimensions was favoured. This could provide a simpler and more generic solution to the problem. For the adolescent health data, a third dimension does not lead to clear separation of the higher grades. The choice of one dimension does lead to groups approximately ordered in grade, although with substantially less definition than the two-dimensional version.

##### Estimation

Leslie suggested improving the efficiency of our MCMC algorithm by updating the group memberships and the latent positions simultaneously. This sounds like a good idea, although it is an empirical question whether the gain in efficiency is worth the resulting greater complexity of the algorithm. Kent points out that it is the shape of the configuration of latent positions that is important. This is correct, and we took account of that by the Procrustes step in our algorithm. However, we would welcome further insights from shape analysis. Snijders recommended assessing how well the data determine the latent positions and suggested sensitivity analysis for this. In fact, the posterior distribution of the latent positions (after the Procrustes step) gives an assessment of this that comes right out of our method, although we did not have space to show it in the paper. Sophisticated plotting of the posterior is implemented in the package, and example code is given to produce the plots for the monks data.

Besag, and Blei and Fienberg asked about the important issue of scalability of the algorithm to larger networks. We have successfully applied the methods to networks with up to 3000 nodes. If necessary, for very large networks, it may be possible to approximate the algorithm without compromising its essential features by case–control sampling of ties in the computation of the likelihood or something like the ICM algorithm (Besag, 1986).

Incidentally, Besag asked why we showed results for only one school in the adolescent health data. This was convenient for presentation, but we have applied the methods to all the schools (Hunter *et al.*, 2006; Goodreau, 2007). We found that the method provides insight into both clusters and segregation. However, for medium-to-large schools (above 500 students) the visualization methods need to be more sophisticated to extract the information (e.g. zooming and slicing).

##### General

Wasserman's frustration in understanding recent advances in statistical network modelling is understandable. Making complex models accessible to practitioners is important. We believe that providing high quality software is becoming an essential part of publication, not least because it allows others to evaluate and critique the models proposed (Handcock *et al.*, 2003a, 2004; Boer *et al.*, 2003). Using this software, social network practitioners, including those from social work, routinely fit these and ERGMs in a class that is taught by one of us (Handcock)! This is part of the reason that the model of Hoff *et al*. (2002) is much used and extended (as the discussants testify).