Exponential families

In choosing a probability model to describe the data generation, it is often convenient to restrict consideration to some family of distributions. Members of a family should have many basic characteristics in common, hopefully enough that we can develop general methods that will work for any member of the family, but should be different enough from one another to provide a variety of possibilities within the same family. The family should be broad enough to cover a wide variety of possibilities that are useful in application but be narrow enough to be able to use common methods.

One family that covers a great many of the more common statistical models is the exponential family. A family \(\{F_{\bf Y}({\bf y};{\boldsymbol \theta}) \}\) of distributions is said to form an \(s\)-parameter exponential family if the distributions have densities (or probability mass functions) of the form \[ f_{\bf Y}({\bf y};{\boldsymbol \theta}) = \exp \left\{\sum_{i=1}^s \eta_i(\boldsymbol \theta) T_i({\bf y}) - B(\theta) +C({\bf y}) \right\}. \] Here the \(\eta_i(\cdot)\) and \(B(\cdot)\) are real-valued functions of the parameters \(\boldsymbol \theta\) and the \(T_i(\cdot)\) and \(C(\cdot)\) are real-valued statistics (i.e. functions of the data vector \(\bf y\)).

For our purposes, we will be looking at the vector \(\bf Y\) whose individual entries \(Y_i\) are independently distributed from the same family. This means that the joint density of \(Y_1, \ldots, Y_n\) (i.e. density of \(\bf Y\)) will simply be the product of the individual densities. We can focus therefore on the density/mass function of a single component \(Y_i\).

In particular, we will focus on random variates \(Y\) whose probability density/mass function has the following form: \[ f_{ Y}(y; \theta, \phi) = \exp \left\{\frac{\left( y \theta - b(\theta) \right)}{a(\phi)} + c(y,\phi) \right\} \] for specific functions \(a(\cdot)\), \(b(\cdot)\), and \(c(\cdot)\). Note that a new parameter \(\phi\) and function \(a(\phi)\) was introduced. This will turn out to be convenient later. If \(\phi\) is known, then this satisfies the above conditions for a 1-parameter exponential family model. If \(\phi\) is not known, then it may or may not satisfy the conditions to be a 2-parameter exponential family model. For what follows, take \(\phi\) to be known unless we say otherwise.

For example, for \(Y \sim N(\mu, \sigma^2)\) \[ \begin{array}{rcl} f_{ Y}({y};{\mu, \sigma^2}) & = & \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left\{-\frac{(y - \mu)^2}{2\sigma^2} \right\}\\ &&\\ &=& \exp \left\{\frac{\left(y\mu - \mu^2/2 \right)}{\sigma^2} - \frac{1}{2}\left( \frac{y^2}{\sigma^2} + \log\left( 2 \pi \sigma^2\right) \right) \right\} \end{array} \] so that \(\theta=\mu\), \(\phi = \sigma^2\), and \[ a(\phi) = \phi, ~~~~~~ b(\theta) = \theta^2/2, ~~~~~~ c(y,\phi) = - \frac{1}{2}\left( \frac{y^2}{\phi} + \log\left( 2 \pi \phi\right) \right). \] Again this is a 1-parameter exponential distribution only when \(\phi = \sigma^2\) is known. (N.B. it is a member of a 2-parameter exponential family if \(\sigma\) is not known.)

Similarly, for \(Y \sim Bernoulli (\mu)\) \[ \begin{array}{rcl} f_{ Y}({ y};{ \mu}) & = & \mu^y (1- \mu)^{1-y}\\ &&\\ &=& \exp \left\{y \log \left( \frac{\mu }{1-\mu} \right) + \log(1 - \mu) \right\} \end{array} \] so that \(\theta=\log \left( \frac{\mu }{1-\mu} \right)\), \(\phi = 1\), and \[ a(\phi) = \phi=1, ~~~~~~ b(\theta) = \log\left( 1 + e^\theta \right), ~~~~~~ c(y,\phi) = 0. \] This is clearly a member of the 1-parameter exponential family.

Note that in both of the above examples, we used \(\eta = \theta\) as our link function. Because of the position of \(\theta\) in these 1-parameter exponential family descriptions, \(\eta = \theta\) is called the canonical link for that family. In any particular application, of course link function other than the canonical link might be chosen.

Likelihood functions for generalized linear models

The likelihood function for these models will be \[L(\theta, \phi; y) \propto f_{ Y}(y; \theta, \phi) \] and given the general structure of \(f_{ Y}(y; \theta, \phi)\) as above, the log likelihood (up to an arbitrary additive constant) is \[ \begin{array}{rcl} l(\theta, \phi; y) &=& \log \left(f_{ Y}(y; \theta, \phi) \right) \\ &&\\ & = & \left\{ y \theta - b(\theta) \right\}/ a(\phi) + c(y,\phi) . \end{array} \] The first two derivatives of the log-likelihood are then easily had as \[ \frac{\partial l}{\partial \theta} = \left\{y - b'(\theta) \right\}/a(\phi) \] and \[ \frac{\partial^2 l}{\partial \theta^2} = \left\{- b''(\theta) \right\}/a(\phi). \] Now, the first of these is simply the score function \(S(\theta; y)\) which, as a function of the random variate \(Y\), namely \(S(\theta; Y)\) has expectation of zero. That is, \[ \begin{array}{rcl} E(S(\theta; Y)) & = & E \left\{ \frac{\partial l(\theta, \phi; Y)}{\partial \theta} \right\} \\ & & \\ & = & E \left\{ \frac{\partial \log( f_Y(y; \theta, \phi))}{\partial \theta} \right\} \\ & & \\ & = & E \left\{\frac{1} {f_Y(y; \theta, \phi)} \frac{\partial f_Y(y; \theta, \phi)} {\partial \theta} \right\}\\ & & \\ & = & \int \left\{\frac{1}{f_Y(y; \theta, \phi)} \frac{\partial f_Y(y; \theta, \phi)}{\partial \theta} \right\}f_Y(y; \theta, \phi) dy\\ & & \\ & = &\frac{\partial }{\partial \theta} \int \left\{ f_Y(y; \theta, \phi) \right\} dy\\ & & \\ & = &\frac{\partial }{\partial \theta} \left\{ 1 \right\} \\ & & \\ & = & 0 \\ \end{array} \] This means we can write \[0 = E(S(\theta; Y)) = \left(E(Y) - b'(\theta) \right)/a(\phi)\] or \[ 0 = \left\{\mu - b'(\theta) \right\}/a(\phi)\] where \(\mu = E(Y)\). Solving this for \(\mu\) and we can express \(E(Y)\) for any member of this family as \[ E(Y) = \mu = b'(\theta) .\]

Similarly, we can determine the \(Var(Y)\).

To do so. we derive a fact about likelihoods.

First from above we have that \[ 0 = \int \left\{ \frac{\partial \log( f_Y(y; \theta, \phi) )}{\partial \theta} \right\}f_Y(y; \theta, \phi) dy. \] Differentiating both sides again with respect to \(\theta\) we have

\[ \begin{array}{rcl} 0 &=& \frac{\partial} {\partial \theta} \int \left\{ \frac{\partial \log( f_Y(y; \theta, \phi) )} {\partial \theta} \right\} f_Y(y; \theta, \phi) ~ dy \\ && \\ && \\ &=& \int\frac{\partial} {\partial \theta} \left( \left\{ \frac{\partial \log( f_Y(y; \theta, \phi) )} {\partial \theta} \right\} f_Y(y; \theta, \phi) \right) ~ dy \\ && \\ && \\ &=& \int \left( \left\{ \frac{\partial^2 \log( f_Y(y; \theta, \phi) )} {\partial \theta^2} \right\} f_Y(y; \theta, \phi) + \left\{ \frac{\partial \log( f_Y(y; \theta, \phi) )} {\partial \theta} \frac{\partial f_Y(y; \theta, \phi) } {\partial \theta} \right\} \right) ~ dy \\ && \\ && \\ &=& \int \left( \frac{\partial^2 \log( f_Y(y; \theta, \phi) )} {\partial \theta^2} \right) f_Y(y; \theta, \phi) ~ dy\\ && ~~~~~~ + \int \left( \frac{\partial \log( f_Y(y; \theta, \phi) )} {\partial \theta} \frac{1}{ f_Y(y; \theta, \phi)} \frac{\partial f_Y(y; \theta, \phi) } {\partial \theta} \right) f_Y(y; \theta, \phi) ~ dy \\ && \\ && \\ &=& \int \left( \frac{\partial^2 \log( f_Y(y; \theta, \phi) )} {\partial \theta^2} \right) f_Y(y; \theta, \phi) ~ dy + \int \left( \frac{\partial \log( f_Y(y; \theta, \phi) )} {\partial \theta} \right)^2 f_Y(y; \theta, \phi) ~ dy \\ && \\ && \\ &=& \int \left( \frac{\partial^2 l(\theta, \phi; y) } {\partial \theta^2} \right) f_Y(y; \theta, \phi) ~ dy + \int \left( \frac{\partial l(\theta, \phi; y) )} {\partial \theta} \right)^2 f_Y(y; \theta, \phi) ~ dy \\ && \\ && \\ &=& E \left( \frac{\partial^2 l(\theta, \phi; Y) } {\partial \theta^2} \right) + E \left( \left[ \frac{\partial l(\theta, \phi; Y) )} {\partial \theta} \right]^2 \right) \\ \end{array} \] Or, written more succinctly, we have that \[E\left(\frac{\partial^2 l}{\partial \theta^2}\right) + E\left(\left[\frac{\partial l}{\partial \theta}\right]^2\right) = 0. \] Substituting the values we have above for the partial derivatives and we get \[ \begin{array}{rcl} 0 &=& E\left(- b''(\theta)/a(\phi)\right) + E\left(\left[\left\{Y - b'(\theta) \right\}/a(\phi)\right]^2\right) \\ && \\ & = & - \frac{b''(\theta)}{a(\phi)} + \frac{Var(Y)}{a(\phi)^2}. \end{array} \] Rewriting and we find that \[ Var(Y) = b''(\theta)a(\phi). \] Note that the variance is the product of two pieces.

The first, \(b''(\theta)\) depends only on \(\theta\), the canonical parameter, and hence on the mean.
It is called the variance function, \(V(\mu)\).

The second piece is \(a(\phi)\) and commonly is of the form \[a(\phi) = \frac{\phi}{w}\] where \(w\) is some known constant. In this case \(\phi\) is called a dispersion parameter.

Independent observations

All of the above was constructed from the probability density/mass function for a single observation \(y\) and its random variate \(Y\).

Suppose we have \(Y_1, \ldots, Y_n\) as independent random variables from the same family (though possibly different means), that is \[ f_{Y_i}(y_i ; \theta_i, \phi) = \exp \left\{\frac{\left( y_i \theta_i - b(\theta_i) \right)}{a_i(\phi)} + c(y_i,\phi) \right\}. \] The likelihood will be formed from the product of these and the log-likelihood will just be a sum that is considerably simplifed by the exponential family we have chosen: \[ \begin{array}{rcl} l( {\boldsymbol \theta}, \phi; {\bf y}) & = & \sum_{i=1}^n l_i( {\theta_i}, \phi; {y_i}) ~~~~ \mbox{ the sum of the individual likelhood contributions, say }\\ && \\ & = & \sum_{i=1}^n \left\{ \frac{\left( y_i \theta_i - b(\theta_i) \right)} {a_i(\phi)} + c(y_i,\phi) \right\} \\ && \\ & =& \sum_{i=1}^n \frac{ y_i \theta_i}{a_i(\phi)} - \sum_{i=1}^n\frac{b(\theta_i)}{a_i(\phi)} + \sum_{i=1}^nc(y_i,\phi) . \\ \end{array} \]

The saturated model

If there is no further restriction on \(\theta_i\) for all \(i=1,\ldots, n\), then the above is the log-likelihood for the saturated model. Maximizing this with respect to \(\theta_i\) we will find the maximum likelihood estimate for \(\theta_i\) as the solution to the equation: \[ y_i = b'(\widehat{\theta}_i)\] which for the normal yields \(\widehat{\theta}_i = y_i\) and hence \(\widehat{\mu}_i = y_i\), and similarly for the bernoulli distribution gives \(y_i = e^{\widehat{\theta}_i}/(1 + e^{\widehat{\theta}_i}) = \widehat{\mu}_i\) - a perfect fit in each case.

This perfect fit for the saturated model will always be the case since we know \(\mu_i = E(Y_i) = b'(\theta_i)\). Hence \(\widehat{\mu}_i = y_i\) for the saturated model.

Let’s denote the maximum likelihood estimate of \(\boldsymbol \theta\) under the saturated model as \(\widehat{\boldsymbol \theta}^*\). Then we can write the log-likelihood function at this value as \[l(\widehat{\boldsymbol \theta}^*, \phi ; {\bf y}).\]

Deviance residuals

Were we to restrict the value of \({\boldsymbol \theta}\) to be in some parameter space \(\Omega\) of dimension \(p <n\) and found the maximum likelihood estimate under this constraint to be \(\widehat{\boldsymbol \theta}\), then the deviance would be calculated as \[ d = 2 \times \left\{l(\widehat{\boldsymbol \theta}^*, \phi ; {\bf y}) - l(\widehat{\boldsymbol \theta}, \phi ; {\bf y}) \right\} \ge 0. \] The random variate \(D\), corresponding to replacing \(y_i\) everywhere by \(Y_i\) would (under the hypothesis \(H:{\boldsymbol \theta} \in \Omega\) ) have an approximately \(\chi_{n-p}^2\) distribution.

As with the two log-likelihoods of which it is composed, for independent observations the deviance statistic can be written as \[ \begin{array}{rcl} d & = & 2 \times \left\{ l(\widehat{\boldsymbol \theta}^*, \phi ; {\bf y}) - l(\widehat{\boldsymbol \theta}, \phi ; {\bf y}) \right\} \\ && \\ &=& 2 \times \sum_{i=1}^n \left\{ l_i(\widehat{ \theta}_i^*, \phi ; y_i) - l_i(\widehat{\theta}_i, \phi ; y_i) \right\} \\ && \\ &=&\sum_{i=1}^n 2 \times \left\{ l_i(\widehat{ \theta}_i^*, \phi ; y_i) - l_i(\widehat{\theta}_i, \phi ; y_i) \right\}\\ && \\ &=&\sum_{i=1}^n d_i ~~~~ \mbox{ say.}\\ \end{array} \] Each such \(d_i\) represents the principal contribution of the observation \(y_i\) (sort of, kind of) to the overall deviance.

Analogous to the estimated residuals in the normal linear model, we calculate \[r_i^D = sign(y_i - \widehat{\mu}_i) \sqrt{~ d_i~}\] and call these the deviance residuals for observations \(i=1, \ldots, n\). Their random counterparts, which we can call \(D_i\) will be (very) approximately \(N(0,1)\) when the hypothesis \(H:{\boldsymbol \theta} \in \Omega\) is true.

The canonical link

If we choose the canonical link function, that is if \(\theta_i = {\bf x}_i^T{\boldsymbol \beta}\) with \({\bf x}_i = (x_{i1}, x_{i2}, \ldots , x_{ip})^T\) and \({\boldsymbol \beta} = (\beta_1, \ldots, \beta_p)^T\), then the log-likelihood becomes

\[ \begin{array}{rcl} l( {\boldsymbol \beta}, \phi; {\bf y}) & =& \sum_{i=1}^n \frac{ y_i ({\bf x}_i^T{\boldsymbol \beta})}{a_i(\phi)} - \sum_{i=1}^n\frac{b({\bf x}_i^T{\boldsymbol \beta})}{a_i(\phi)} + \sum_{i=1}^nc(y_i,\phi) \\ && \\ && \\ & =& \sum_{i=1}^n \frac{ y_i \sum _{j=1}^p \beta_j x_{ij}}{a_i(\phi)} - \sum_{i=1}^n\frac{b({\bf x}_i^T{\boldsymbol \beta})}{a_i(\phi)} + \sum_{i=1}^nc(y_i,\phi) \\ && \\ && \\ & =& \sum _{j=1}^p \beta_j \sum_{i=1}^n \frac{ y_i x_{ij}}{a_i(\phi)} - \sum_{i=1}^n\frac{b({\bf x}_i^T{\boldsymbol \beta})}{a_i(\phi)} + \sum_{i=1}^nc(y_i,\phi) .\\ \end{array} \] When \(\phi\) is known, then, the function of the data (the \(y_i\)s ) that determines the log-likelihood for parameter \(\beta_j\) (up to an additive constant not involving the parameters) is \[\sum_{i=1}^n \frac{ y_i x_{ij}}{a_i(\phi)}.\] This is the likelihood statistic for \(\beta_j\) and hence is a sufficient statistic for estimating \(\beta_j\). If we also (as we often will) have \(a_i(\phi) = a(\phi)\), i.e. it is the same function for all \(i\), then the statistic \[\sum_{i=1}^n y_i x_{ij}\] will be sufficient for estimating \(\beta_j\), and the vector \({\bf X}^T{\bf y}\) will be sufficient for estimating the parameter vector \({\boldsymbol \beta}\).

Generalized Linear Models

R.W. Oldford

Fall 2015