Curved exponential family models for social networks☆
Introduction
For a fixed set of n actors, or nodes, and a network on those nodes, assume that Y denotes the adjacency matrix for the network; that isIn some social networks applications, the goal is to produce a probabilistic model for Y based on an observed network dataset. It is the goal of this article to explain a particular class of models, called curved exponential family models, for achieving this end. We assume here that the reader is at least somewhat conversant in certain basic techniques of statistical modelling such as logistic regression, though not necessarily familiar with the intricacies of statistical modelling of networks.
Implicit in definition (1) is the fact that we assume no valued or multiple edges are allowed; furthermore, we disallow self-edges, so for all i. Finally, we will treat only undirected networks in this article, which implies that , though we make this choice only to simplify some of the arguments; there is little difficulty in extending all of the results here to the case of directed networks.
Beyond the network information contained in Y, there are often additional data, such as a set of measured characteristics for each node in the network. For instance, when the nodes are people, we may know each person’s age and sex. Throughout this article, we let X denote the additional data.
We assume throughout this article that the probability of observing a particular network is a function of statistics that may depend on the network itself as well as covariates measured on the nodes. For the particular class of models known as exponential random graph models (ERGMs), the relationship between a particular graph y and its probability of occurrence conditional on the additional data X is generally expressed aswhere is a user-defined p-vector of statistics and denotes the statistical parameter governing the probabilistic formation of the network. The denominator, , is a normalizing constant that ensures that the sum of (2) over all possible y equals 1.
ERGMs are sometimes known in the social networks literature as p-star models (Wasserman and Pattison, 1996). We use “ERGM” instead of “p-star” here due to the vast statistical literature covering exponential family models (e.g., Barndorff-Nielsen, 1978, Brown, 1986). Nevertheless, we consider “p-star” to be synonymous with “ERGM” in this article, with one caveat: Wasserman and Pattison (1996) used a method of parameter estimation, maximum pseudo-likelihood estimation, that has come to be closely associated with the p-star models themselves. However, in this article we separate the name of the models (ERGMs or p-star models) from the method of estimating their parameters. In particular, we do not discuss maximum pseudo-likelihood estimation here, focusing instead on the better-understood method of maximum likelihood estimation. ERGMs are discussed in detail by Robins et al. (2007a).
The remainder of this article concerns generalizations of (2) known as curved exponential family models. Section 2 discusses these models in general terms, while Sections 3 Rewriting alternating, 4 Shared partner statistics focus on specific examples—namely, models involving the alternating k-star, alternating k-triangle, and alternating k-twopath statistics developed by Snijders et al. (in press). These statistics have recently shown great promise in producing parsimonious models that fit certain social network datasets well; they are discussed further by Robins et al. (2007b). Much of Sections 3 Rewriting alternating, 4 Shared partner statistics is devoted to a reformulation of these statistics in terms of the degree statistics and the recently developed shared partner statistics, as well as an attempt to reveal how this reformulation aids in interpreting these statistics. Finally, Section 5 demonstrates the use of these models on real data. The analysis, which recreates and extends some earlier work using the same dataset, is carried out using the statnet package for R, available for use at csde.washington.edu/statnet. The computer code used in Section 5 may be found in Appendix A.
Section snippets
Curved exponential families
In model (2), the maximum likelihood estimator (MLE) of the parameter vector is, by definition, the vector that maximizes as a function of , where is the observed network. In other words, if we let denote the MLE and we assume that is a vector contained in p-dimensional space (denoted ), then we may writeIt is worth noting here that it is extremely difficult to find except in the case of a simplistic model (e.g., one in which all
Rewriting alternating k-stars
The first curved exponential family model we will consider involves the k-star statistics , where denotes the number of k-stars in the graph y. A k-star is a set of k distinct edges that all share an endpoint. In particular, a 1-star is simply an edge. Note that the number of edges in the graph y is sometimes denoted by , though we use in this article.
When each k-star statistic has its own coefficient in an ERGM, the resulting model is
Shared partner statistics
Section 3 defines the alternating k-star statistic and shows that it may be rewritten as in Eq. (12) in terms of the degree statistics. In an analogous way, we now define the alternating k-triangle and alternating k-twopath statistics of Snijders et al. (in press) and show that they may be rewritten in terms of the edgewise and dyadic shared partner statistics, respectively, which we also define.
The k-triangle and the k-twopath are concepts that generalize the ideas of triangle and 2-star,
Lazega’s Lawyer dataset
Lazega Lazega and Pattison, 1999, Lazega, 2001 collected and analyzed data on working relations among 36 partners in a New England law firm. Here, we deal with only a subset of the data: An (undirected) edge will be said to exist between two partners if and only if each indicates a collaboration with the other. These data are analyzed by both Snijders et al., in press, Hunter and Handcock, 2006. Here, we employ models similar to those used in both of these articles in order to compare and
Discussion
The class of curved exponential family models is a major generalization of the ERGM model class. Though we have focused here only on very particular models arising from the work of Snijders et al. (in press), we have shown how curved exponential family models can achieve parsimonious descriptions of data by reducing a large number of parameters to only a few (e.g., in the case of the GWD term, reducing from parameters to only two). These reductions also allow much better behavior of maximum
References (18)
Advances in exponential random graph (p*) models applied to a large social network
Social Networks
(2007)- et al.
Multiplexity, generalized exchange and cooperation in organizations: a case study
Social Networks
(1999) - et al.
An introduction to exponential random graph (p*) models for social networks
Social Networks
(2007) - et al.
Recent developments in exponential random graph (p*) models for social networks
Social Networks
(2007) - et al.
Statistical mechanics of complex networks
Reviews of Modern Physics
(2002) Information and Exponential Families in Statistical Theory
(1978)- Brown, L.D., 1986. Fundamentals of statistical exponential families. IMS Lecture Notes in Monograph Series...
Defining the curvature of a statistical problem (with applications to second order efficiency) (with discussion)
Annals of Statistics
(1975)The geometry of exponential families
Annals of Statistics
(1978)
Cited by (372)
Topological analysis, endogenous mechanisms, and supply risk propagation in the polycrystalline silicon trade dependency network
2024, Journal of Cleaner ProductionSustainable network analysis and coordinated development simulation of urban agglomerations from multiple perspectives
2023, Journal of Cleaner Production
- ☆
This research is supported by Grant DA012831 from NIDA and Grant HD041877 from NICHD.