Appendix B: Statistics

The material in this appendix can be supplemented with the classic text in statistics: The Theory of Point Estimation Wiley, New York by E. Lehmann (1983).


Sufficiency, Completeness and Unbiased Estimation

In statistics, we often represent our data, in many cases a sample of size $n $ from some population as a random vector MATH. The model, can be written in the form MATH where $\Omega$ is the parameter space or set of permissible values of the parameter and $f_{\theta}(x)$ is the probability density function. A statistic, $T(X),$ is a function of the data which does not depend on the unknown parameter $\theta$. Although a statistic, $T(X),$ is not a function of $\theta$, its distribution can depend on $\theta.$ An estimator is a statistic considered for the purpose of estimating a given parameter. One of our objectives is to find a ``good'' estimator of the parameter $\theta$, in some sense of the word ``good''. How do we ensure that a statistic $T(X)$ is estimating the correct parameter and not consistently too large or too small, and that as much variability as possible has been removed? The problem of estimating the correct parameter is often dealt with by requiring that the estimator be unbiased.

We will denote an expected value under the assumed parameter value $\theta$ by $E_{\theta}(.)$. Thus, in the continuous case MATH and in the discrete case MATH provided the integral/sum converges absolutely. In the discrete case, MATH the probability function of $X$ under this parameter value $\theta.$

Definition

A statistic $T(X)$ is an unbiased estimator of $\theta$ if MATH for all $\theta\in\Omega$.

For example suppose that $X_{i}$ are independent, each with the Poisson distribution with parameter $\theta,i=1,...,n$. Notice that the statistic MATH is such thatMATH and so $T$ is an unbiased estimator of $\theta.$ This means that it is centered in the correct place, but does not mean it is a best estimator in any sense.

In Decision Theory, in order to determine whether a given estimator or statistic $T(X)$ does well for estimating $\theta$ we consider a loss function or distance function between the estimator and the true value. Call this MATH. Then this is averaged over all possible values of the data to obtain the risk: MATH A good estimator is one with little risk, a bad estimator is one whose risk is high. One particular risk function is called mean squared error (M.S.E.) and corresponds to MATH. The mean squared error has a useful decomposition into two components the variance of the estimator and the square of its bias: MATH

For example, if $X$ has a Normal$(\theta,1)$ distribution, the mean squared error of $T_{1}=X$ is 1 for all $\theta$ because the bias MATH is zero. On the other hand the estimator $T_{2}=X/2$ has bias MATH and variance $\frac{1}{4}$ so the mean squared error is MATH Obviously $T_{2}$ has smaller mean squared error provided that $\theta$ is around 0 (more precisely provided $\theta^{2}<3)$, but for $\theta$ large, $T_{1}$ is preferable. Of these two estimators, only $T_{1}$ is unbiased.

In general, in fact, there is usually no one estimator which outperforms all other estimators at all values of the parameter if we use mean squared error as our basis for comparison. In order to achieve an optimal estimator, it is unfortunately necessary to restrict ourselves to a specific class of estimators and select the best within the class. Of course, the best within this class will only be as good as the class itself (best in a class of one is not much of a recommendation), and therefore we must ensure that restricting ourselves to this class is not unduly restrictive. The class of all estimators is usually too large to obtain a meaningful solution. One common restriction is to the class of all unbiased estimators.

Definition

An estimator $T(X)$ is said to be a uniformly minimum variance unbiased estimator (U.M.V.U.E.) of the parameter $\theta$ if

(i) it is an unbiased estimator of $\theta$ and

(ii) among all unbiased estimators of $\theta$ it has the smallest mean squared error and therefore the smallest variance.

A sufficient statistic is one that, from a certain perspective, contains all the necessary information for making inferences (e.g. estimating the parameter with a point estimator or confidence interval, conducting a test of a hypothesized value) about the unknown parameters in a given model. It is important to remember that a statistic is sufficient for inference on a specific parameter. It does not necessary contain all relevant information in the data for other inferences. For example if you wished to test whether the family of distributions is an adequate fit to the data ( a goodness of fit test) the sufficient statistic for the parameter in the model does not contain the relevant information.

Suppose the data is in a vector $X$ and $T=T(X)$ is a sufficient statistic for $\theta$. The intuitive basis for sufficiency is that if the conditional distribution of $X$ given $T(X)$ does not depend on $\theta$, then $X$ provides no additional value in addition to $T$ for estimating $\theta$. The assumption is that random variables carry information on a statistical parameter $\theta$ only insofar as their distributions (or conditional distributions) change with the value of the parameter and that since, given $T(X)$ we can randomly generate at random values for the $X$ without knowledge of the parameter and with the correct distribution, these randomly generated values cannot carry additional information. All of this, of course, assumes that the model is correct and $\theta$ is the only unknown. The distribution of $X$ given a sufficient statistic $T$ will often have value for other purposes, such as measuring the variability of the estimator or testing the validity of the model.

Definition

A statistic $T(X)$ is sufficient for a statistical model MATH if the distribution of the data MATH given $T(X)=t$ does not depend on the unknown parameter $\theta$.

The use of a sufficient statistic is formalized in the The Sufficiency Principle, which states that if $T(X)$ is a sufficient statistic for a model $\{f_{\theta}(x);$ $\theta\in\Omega\}$ and $x_{1},$ $x_{2}$ are two different possible observations that have identical values of the sufficient statistic: MATH then whatever inference we would draw from observing $x_{1}$ we should draw exactly the same inference from $x_{2}$.

Sufficient statistics are not unique. For example if the sample mean MATH is a sufficient statistic, then any other statistic, that allows us to obtain $\bar{X}$ is also sufficient. This will include all one-to-one functions of $\bar{X}$ (these are essentially equivalent) like $\bar{X}^{3}$ and all statistics $T(X)$ for which we can write $\bar{X}=g(T)$ for some, possibly many-to-one function $g$. One result which is normally used to verify whether a given statistic is sufficient is the Factorization Criterion for Sufficiency: Suppose MATH has probability density function MATH and $T(X)$ is a statistic. Then $T(X)$ is a sufficient statistic for MATH if and only if there exist two non--negative functions $g(.)$ and $h(.)$ so that we can factor the probability density function MATH for all $x.$ This factorization into two pieces, one which involves both the statistic $T$ and the unknown parameter $\theta,$ and the other which may be a constant or depend on $x$ but does not depend on the unknown parameter, need only hold on a set $A$ of possible values of $X$ which carries the full probability. That is for some set $A$ with MATH for all $\theta \in\Omega$, we require MATH

Definition

A statistic $T(X)$ is a minimal sufficient statistic for MATH if it is sufficient and if for any other sufficient statistic $U(X)$, there exists a function $g(.)$ such that $T(X)=g(U(X))$.

This definition says in effect that a minimal sufficient statistic can be recovered from any other sufficient statistic. A statistic $T(X)$ implicitly partitions the sample space into events of the form $[T(X)=x]$ for varying $x,$ and if $T(X)$ is minimal sufficient, it induces the coarsest possible partition (i.e. the largest possible sets) in the sample space among all sufficient statistics. This partition is called the minimal sufficient partition.

The property of completeness is one which is useful for determining the uniqueness of estimators and verifying in some cases that a minimal sufficient reduction has been found. It bears no relation to the notion of a complete market in Finance, or the mathematical notion of a complete metric space. Let MATH denote the observations from a distribution with probability density function MATH. Suppose $T(X)$ is a statistic and $u(T)$, a function of $T$, is an unbiased estimator of $\theta$ so that MATH $\ \QTR{rm}{for}$ $\QTR{rm}{all}$ $\ \theta\in\Omega$. Under what circumstances is this the only unbiased estimator which is a function of $T$? To answer this question, suppose $u_{1}(T)$ and $u_{2}(T)$ are both unbiased estimators of $\theta$ and consider the difference MATH. Since $u_{1}(T)$ and $u_{2}(T)$ are both unbiased estimators of the parameter $\theta,$ we have $E_{\theta}[h(T)]=0$ for all $\theta\in\Omega$. Now if the only function $h(T)$ which satisfies $E_{\theta}[h(T)]=0$ for all $\theta\in\Omega$ is the zero function $h(t)=0$, then the two unbiased estimators must be identical. A statistic $T$ with this property is said to be complete. Technically it is not the statistic that is complete, but the family of distributions of $T$ in the model MATH

Definition

The statistic $T(X)$ is complete if MATH for any function $h$ implies MATH

For example, let MATH be a random sample from the Normal$(\theta,1)$ distribution. Consider MATH. Then $T$ is sufficient for MATH but is not complete. It is easy to see that it is not complete, because the function MATH is a function of $T$ which has zero expectation for all values of $\theta,$ and yet the function is not identically zero. The fact that the statistic MATH is sufficient but not complete is a hint that further reduction is possible, that it is not minimal sufficient. In fact in this case, as we will show a little later, taking only the second component of $T,$ namely MATH provides a minimal sufficient, complete statistic.

Theorem B1

If $T(X)$ is a complete and sufficient statistic for the model MATH, then $T(X)$ is a minimal sufficient statistic for the model.

The converse to the above theorem is not true. Let MATH be a random sample from the continuous uniform distribution on the interval MATH. This distribution has probability density function MATH Then using the factorization criterion above, the joint probability density function for a sample of $n$ independent observations from this density is MATH where MATH is one or zero as the inequality holds or does not hold and $x_{(1)},x_{(n)}$ are the smallest and the largest values in the sample MATH Obviously MATH can be written as a function $g(T(x);\theta)$ where MATH and so $T(X)$ is sufficient. Moreover it is not difficult to show that no further reduction (for example to $X_{(1)}$ alone) is possible or we can not longer provide such a factorization, so $T(X)$ us minimal sufficient. Nevertheless, if MATH and the function $h$ is defined by MATH (clearly a non-zero function) then $E_{\theta}[h(T)]=0$ for all $\theta \in\Omega$ and therefore $T(X)$ is not a complete statistic.

Theorem B2

For any random variables $X$ and $Y$, MATH and MATH

In much of what follows, we wish to be able to estimate a general function of the unknown parameter like $\tau(\theta)$ instead of the parameter $\theta$ itself. We have already seen that if $T(X)$ is a complete statistic, then there is at most one function of $T(X)$ that provides an unbiased estimator of any function of a given $\tau(\theta).$ In fact if we can find such a function, $g(T(X)),$ then it automatically has minimum variance among all possible unbiased estimators of $\tau(\theta)$ that are based on the same data.

Theorem B3

If $T(X)$ is a complete sufficient statistic for the model MATH and MATH, then $g(T(X))$ is the U.M.V.U.E. of $\tau(\theta)$.

When we have a complete sufficient statistic, and we are able to find an unbiased estimator, even a bad one, of $\tau(\theta),$ then there is a simple recipe for determining the U.M.V.U.E. of $\tau(\theta).$

Theorem B4

If $T(X)$ is a complete sufficient statistic for the model MATH and $U(X)$ is any unbiased estimator of $\tau(\theta )$, then $E(U|T)$ is the U.M.V.U.E. of $\tau(\theta)$.




Note that we did not subscript the conditional expectation $E(U|T)$ with $\theta$ because whenever $T$ is a sufficient statistic, the conditional distribution of $U(X)$ given $T$ does not depend on the underlying value of the parameter $\theta.$

Definition

Suppose MATH has a (joint) probability density function of the form MATH for functions $q_{j}(\theta),$ $T_{j}(x),$ $h(x),$ $C(\theta)$. Then we say that the density is a member of the exponential family of densities. We call MATH the natural sufficient statistic.

A member of the exponential family could be re-expressed in different ways and so the natural sufficient statistic is not unique. For example we may multiply a given $T_{j}$ by a constant and divide the corresponding $q_{j}$ by the same constant, resulting in the same probability density function $f_{\theta}(x)$. Various other conditions need to be applied as well, for example to insure that the $T_{j}(x)$ are all essentially different functions of the data. One of the important properties of the exponential family is its closure under repeated independent sampling. In general if $X_{i},i=1,...,n$ are independent identically distributed with an exponential family distribution then their joint distribution $(X_{1},...,X_{n})$ is also an exponential family distribution.

Theorem B5

Let (MATH be a random sample from the distribution with probability density function given by (exfd). Then MATH also has an exponential family form, with joint probability density function MATH

In other words, $C$ is replaced by $C^{n}$ and $T_{j}(x)$ by MATH. The natural sufficient statistic is MATH

It is usual to reparameterize equation (exfd) by replacing $q_{j}(\theta)$ by a new parameter $\eta_{j}$. This results in a more efficient representation, the canonical form of the exponential family density: MATH The natural parameter space in this form is the set of all values of $\eta$ for which the above function is integrable; that is MATH We would like this parameter space to be large enough to allow intervals for each of the components of the vector $\eta$ and so we will later need to assume that the natural parameter space contains a $k-$dimensional rectangle.

If the statistic satisfies a linear constraint, for example, MATH with probability one, then the number of terms $k$ could be reduced and a more efficient representation of the probability density function is possible. Similarly if the parameters $\eta_{j}$ satisfy a linear relationship, they are not all statistically meaningful because one of the parameters is obtainable from the others. These are all situations that we would handle by reducing the model to a more efficient and non-redundant form. So in the remaining, we will generally assume such a reduction has already been made and that the exponential family representation is minimal in the sense that neither the $\eta_{j}$ nor the $T_{j}$ satisfy any linear constraints.

Definition

We will say that $X$ has a regular exponential family distribution if it is in canonical form, is of full rank in the sense that neither the $T_{j} $ nor the $\eta_{j}$ satisfy any linear constraints permitting a reduction in the value of $k$, and the natural parameter space contains a $k-$dimensional rectangle.




By Theorem B5, if $X_{i}$ has a regular exponential family distribution then MATH also has a regular exponential family distribution.




The main advantage identifying a distribution as a member of the regular exponential family is that it allows to to quickly identify the minimal sufficient statistic and conclude that it is complete.

Theorem B6

If $X$ has a regular exponential family distribution then MATH is a complete sufficient statistic.




Example

Let MATH be independent observations all from the normal $N(\mu,\sigma^{2})$ distribution. Notice that with the parameter MATH we can write the probability density function of each $X_{i}$ as a constant MATH MATH so the natural parameters are MATH and MATH and the natural sufficient statistic is ($X,X^{2}).$ For a sample of size $n$ from this density we have the same natural parameters, and, by the above theorem, a complete sufficient statistic is (MATH For example if you wished to find a U.M.V.U.E. of any function of $\eta_{1},\eta_{2}$ , for example the parameter MATH we need only find some function of the compete sufficient statistic which has the correct expected value. For example, in this case, with the sample mean MATH and the sample variance MATH it is not difficult to show that MATH and so, provided $n>3,$ MATH is an unbiased estimator and a function of the complete sufficient statistic so it is the desired U.M.V.U.E. Suppose one of the parameters, say $\sigma^{2}$ is assumed known. Then the normal distribution is still in the regular exponential family, since it has a representation MATH with the function $h$ completely known. In this case, for a sample of size $n $ from this distribution, the statistic MATH is complete sufficient for $\mu$ and so any function of it, say $\overline{X}$ which is an unbiased estimator of $\mu$ is automatically U.M.V.U.E.




The Table below gives various members of the regular exponential family and the corresponding complete sufficient statistic.

Members of the Regular Exponential Family Complete Sufficient Statistic
MATH MATH
MATH MATH
$\QTR{rm}{Negative}$ Binomial MATH MATH
Geometric($\theta)$ MATH
MATH if $\sigma^{2}$ known MATH
MATH $\mu$ known MATH
MATH MATH
MATH (includes exponential) if $\alpha$ known MATH
MATH if $\beta$ known MATH
MATH MATH

Differentiating under the Integral

For a regular exponential family, it is possible to differentiate under the integral, that is, MATH for any $m=1,2,\ldots$ and any $\eta$ in the interior of the natural parameter space.

Let MATH denote observations from a distribution with probability density function MATH and let $U(X)$ be a statistic. The information on the parameter $\theta$ is provided by the sensitivity of the distribution of a statistic to changes in the parameter. For example, suppose a modest change in the parameter value leads to a large change in the expected value of the distribution resulting in a large shift in the data. Then the parameter can be estimated fairly precisely. On the other hand, if a statistic $U$ has no sensitivity at all in distribution to the parameter, then it would appear to contain little information for point estimation of this parameter. A statistic of the second kind is called an ancillary statistic.

Definition

$U(X)$ is an ancillary statistic if its distribution does not depend on the unknown parameter $\theta$.

Ancillary statistics are, in a sense, orthogonal or perpendicular to minimal sufficient statistics and are analogous to the residuals in a multiple regression, while the complete sufficient statistics are analogous to the estimators of the regression coefficients. It is well-known that the residuals are uncorrelated with the estimators of the regression coefficients (and independent in the case of normal errors). However, the ``irrelevance'' of the ancillary statistic seems to be limited to the case when it is not part of the minimal (preferably complete) sufficient statistic as the following example illustrates.

Example

Suppose a fair coin is tossed to determine a random variable $N=1$ with probability $1/2$ and $N=100$ otherwise. We then observe a Binomial random variable $X$ with parameters $(N,\theta)$. Then the minimal sufficient statistic is $(X,N)$ but $N$ is an ancillary statistic since its distribution does not depend on the unknown parameter $\theta$. Is $N$ completely irrelevant to inference about $\theta$? If you reported to your boss an estimator of $\theta$ such as $X/N$ without telling him or her the value of $N,$ how long would you expect to keep your job? Clearly any sensible inference about $\theta$ should include information about the precision of the estimator, and this inevitably requires knowing the value of $N.$ Although the distribution of $N$ does not depend on the unknown parameter $\theta$ so that $N$ is ancillary, it carries important information about precision. The following theorem allows us to use the properties of completeness and ancillarity to prove the independence of two statistics without finding their joint distribution.

Basu's Theorem B7

Consider $X$ with probability density function MATH. Let $T(X)$ be a complete sufficient statistic. Then $T(X)$ is independent of every ancillary statistic $U(X)$.

Example

Assume $X_{t}$ represents the market price of a given asset such as a portfolio of stocks at time $t$ and $x_{0}$ is the value of the portfolio at the beginning of a given time period (assume that the analysis is conditional on $x_{0}$ so that $x_{0}$ is fixed and known). The process $X_{t}$ is assumed to be a Brownian motion and so the distribution of $X_{t}$ for any fixed time $t$ is NormalMATH for $0<t\leq1$. Suppose that for a period of length 1, we record both the period high MATH and the close $X_{1}$. Define random variables MATH and $Y=X_{1}-x_{0}$. Then the joint probability density function of $(M,Y)$ can be shown to be MATH

It is not hard to show that this is a member of the regular exponential family of distributions with both parameters assumed unknown. If one parameter is known, for example, $\sigma^{2}$, it is again a regular exponential family distribution with $k=1.$ Consequently, if we record independent pairs of observations $(M_{i},Y_{i}),$ $\ i=1,\ldots n$ on the portfolio for a total of $n$ distinct time periods (and if we assume no change in the parameters), then the statistic MATH is a complete sufficient statistic for the drift parameter $\mu$. Since it is also an unbiased estimator of $\mu,$ it is the U.M.V.U.E. of $\mu$. By Basu's theorem it will be independent of any ancillary statistic, i.e. any statistic whose distribution does not depend on the parameter $\mu.$ One such statistic is MATH which is therefore independent of $\overline{Y}.$

Maximum Likelihood Estimation

Suppose we have observed $n$ independent discrete random variables all with probability density function MATH where the scalar parameter $\theta$ is unknown. Suppose our observations are $x_{1},\ldots,x_{n}$. Then the probability of the observed data is: MATH When the observations have been substituted, this becomes a function of the parameter only, referred to as the likelihood function and denoted $L(\theta)$. Its natural logarithm is usually denoted MATH. Now in the absence of any other information, it seems logical that we should estimate the parameter $\theta$ using a value most compatible with the data. For example we might choose the value maximizing the likelihood function $L(\theta)$ or equivalently maximizing $\ell(\theta)$. We call such a maximizer the maximum likelihood (M.L.) estimate provided it exists and satisfies any restrictions placed on the parameter. We denote it by $\hat{\theta}$. Obviously, it is a function of the data, that is, MATH. The corresponding estimator is MATH. In practice we are usually satisfied with a local maximum of the likelihood function provided that it is reasonable, partly because the global maximization problem is often quite difficult, partly because the global maximum is not always better than a local maximum near a preliminary estimator that is known to be consistent.. In the case of a twice differentiable log likelihood function on an open interval, this local maximum is usually found by solving the equation $S(\theta)=0$ for a solution $\hat{\theta}$, where MATH is called the score function. The equation $S(\theta)=0$ is called the (maximum) likelihood equation or score equation. To verify a local maximum we compute the second derivative MATH and show that it is negative, or alternatively show MATH. The function MATH is called the information function. In a sense to be investigated later, MATH, the observed information, indicates how much information about a parameter is available in a given experiment. The larger the value, the more curved is the log likelihood function and the easier it is to find the maximum.

Although we view the likelihood, log likelihood, score and information functions as functions of $\theta$ they are, of course, also functions of the observed data MATH. When it is important to emphasize the dependence on the data $x$ we will write $L(\theta;x),$ $S(\theta;x),$ etc. Also when we wish to determine the sampling properties of these functions as functions of the random variable MATH we will write $L(\theta;X),$ $S(\theta;X),$ etc.

Definition

The Fisher or expected information (function) is the expected value of the information function MATH.

Likelihoods for Continuous Models

Suppose a random variable $X$ has a continuous probability density function $f_{\theta}(x)$ with parameter $\theta$. We will often observe only the value of $X$ rounded to some degree of precision (say 1 decimal place) in which case the actual observation is a discrete random variable. For example, suppose we observe $X$ correct to one decimal place. Then MATH assuming the function $f_{\theta}(x)$ is quite smooth over the interval. More generally, if we observe $X$ rounded to the nearest $\Delta$ (assumed small) then the likelihood of the observation is approximately MATH. Since the precision $\Delta$ of the observation does not depend on the parameter, then maximizing the discrete likelihood of the observation is essentially equivalent to maximizing the the probability density function MATH over the parameter. This partially justifies the use of the probability density function in the continuous case as the likelihood function.

Similarly, if we observed $n$ independent values $x_{1},\ldots,x_{n}$ of a continuous random variable, we would maximize the likelihood MATH (or more commonly its logarithm) to obtain the maximum likelihood estimator of $\theta$.

The relative likelihood function $R(\theta),$ defined as MATH is the ratio of the likelihood to its maximum value and takes on values between $0$ and $1$. It is used to rank possible parameter values according to their plausibility in light of the data. If $R(\theta_{1})=0.1$, say, then $\theta_{1}$ is rather an implausible parameter value because the data are ten times more likely when MATH than they are when $\theta=\theta_{1}$. The set of $\theta$-values for which $R(\theta)\geq p$ is called a $100p\%$ likelihood region for $\theta .$When the parameter $\theta$ is one-dimensional, and $\theta_{0}$ is its true value, MATH converges in distribution as the sample size $n\rightarrow\infty$ to a chi-squared distribution with 1 degree of freedom. More generally, the numbers of degrees of freedom of the limiting chi-squared distribution is the dimension of the parameter $\theta.$ We can use this to construct a confidence interval for the unknown value of the parameter. For example if $b $ is chosen to be the 0.95 quantile of the chi-squared(1) distribution ($b=3.84)$, then MATH so a $15\%$ likelihood interval is an approximate $95\%$ confidence interval for $\theta$. This seems to indicate that the confidence interval tolerates a considerable difference in the likelihood. The likelihood at a parameter value must differ from the maximum likelihood by a factor of more than $6$ before it is excluded by a $95\%$ confidence interval or rejected by a test with level of significance 5%.

Properties of the Score and Information

Consider a continuous model with a family of probability density functions MATH. Suppose all of the densities are supported on a common set MATH. Then MATH and therefore MATH provided that the integral can be interchanged with the derivative. Models that permit this interchange, and calculation of the Fisher information, are called regular models.

Regular Models

Consider a statistical model MATH with each density supported by a common set $A$. Suppose $\Omega$ is an open interval in the real line and $f_{\theta}(x)>0$ for all $\theta\in\Omega$ and $x\in A$. Suppose in addition

  1. $\ln[f_{\theta}(x)]$ is a continuous, three times differentiable function of $\theta$ for all $x\in A$.

  2. MATH $\ \ k=1,2 $

  3. MATH for some function $M(x)$ satisfying MATH.

  4. MATH.

Then we call this a regular family of distributions or a regular model. Similarly, if these conditions hold with $X$ a discrete random variable and the integrals replaced by sums, the family is also called regular. Conditions like these permitting the interchange of expected values and derivative are sometimes referred to as the Cramer conditions. In general, they are used to justify passage of a derivative under an integral.

Theorem B8

If MATH is a random sample from a regular model MATH then MATH and MATH

The Multiparameter Case

The case of several parameters is exactly analogous to the scalar parameter case. Suppose MATH. In this case the ``parameter'' can be thought of as a column vector of $k$ scalar parameters. The score function $S(\theta)$ is a $k-$dimensional column vector whose $ith$ component is the derivative of $\ell(\theta)$ with respect to the $ith$ component of $\theta$, that is, MATH The observed information function $I(\theta)$ is a $k\times k$ matrix whose $(i,j)$ element is MATH MATH The Fisher information is a $k\times k$ matrix whose components are component-wise expectations of the information matrix, that is MATH The definition of a regular family of distributions is similarly extended. For a regular family of distributions MATH and the covariance matrix of the score function varMATH is the Fisher information, i.e. MATH

Maximum likelihood Estimation in the Exponential Family

Suppose $X$ has a regular exponential family distribution of the form MATH Then MATH and MATH Therefore the maximum likelihood estimator of $\eta$ based on a random sample $(X_{1},...,X_{n})$ from $f_{\eta}(x)$ is the solution to the $k$ equations MATH The maximum likelihood estimators are obtained by setting the sample moments of the natural sufficient statistic equal to their expected values and solving.

Finding Maximum likelihood estimates using Newton's Method

Suppose that the maximum likelihood estimate $\hat{\theta}$ is determined by the likelihood equation MATH It frequently happens that an analytic solution for $\hat{\theta}$ cannot be obtained. If we begin with an approximate value for the parameter, $\theta^{(0)}$, we may update that value as follows: MATH and provided that convergence of $\theta^{(i)},$ MATH obtains, it converges to a solution to the score equation above. In the multiparameter case, where $S(\theta)$ is a vector and $J(\theta)$ is a matrix, then Newton's method becomes: MATH In both of these, we can replace the information function by the Fisher information for a similar algorithm.$.$

Suppose we consider estimating a parameter $\tau(\theta),$where $\theta$ is a scalar, using an unbiased estimator $T(X)$. Is there any limit to how well an estimator like this can behave? The answer for unbiased estimators is in the affirmative. A lower bound on the variance is given by the information inequality.

Information Inequality

Suppose $T(X)$ is an unbiased estimator of the parameter $\tau(\theta)$ in a regular statistical model MATH. Then MATH Equality holds if and only if $f_{\theta}(x)$ is regular exponential family with natural sufficient statistic $T(X)$.




If equality holds in (CRLB) then we call $T(X)$ an efficient estimator of $\tau(\theta)$. The number on the right hand side of (CRLB), MATH

  1. is called the Cramér-Rao lower bound (C.R.L.B.). We often express the efficiency of an unbiased estimator using the ratio of (C.R.L.B.) to the variance of the estimator. Large values of the efficiciency (i.e. near one) indicate that the variance of the estimator is close to the lower bound.

The special case of the information inequality that is of most interest is the unbiased estimation of the parameter $\theta$. The above inequality indicates that any unbiased estimator $T$ of $\theta$ has variance at least $1/J(\theta)$. The lower bound is achieved only when $f_{\theta}(x)$ is regular exponential family with natural sufficient statistic $T$, so even in the exponential family, only certain parameters are such that we can find unbiased estimators which achieve the C.R.L.B., namely those that are expressible as the expected value of the natural sufficient statistics.

The Multiparameter Case

The right hand side in the information inequality generalizes naturally to the multiple parameter case in which $\theta$ is a vector. For example if MATH, then the Fisher information $J(\theta)$ is a $k\times k$ matrix. If $\tau(\theta)$ is any real-valued function of $\theta$ then its derivative is a column vectorMATH. Then if $T(X)$ is any unbiased estimator of $\tau(\theta)$ in a regular model, MATH for all $\theta\in\Omega$.

Asymptotic Properties of maximum likelihood Estimators

One of the more successful attempts at justifying estimators and demonstrating some form of optimality has been through large sample theory or the asymptotic behaviour of estimators as the sample size $n\rightarrow\infty$. One of the first properties one requires is consistency of an estimator. This means that the estimator converges to the true value of the parameter as the sample size (and hence the information) approaches infinity.

Definition

Consider a sequence of estimators $T_{n}$ where the subscript $n$ indicates that the estimator has been obtained from data MATH with sample size $n$. Then the sequence is said to be a consistent sequence of estimators of $\tau(\theta)$ if MATH for all $\theta\in\Omega$.




It is worth a reminder at this point that probability density functions are used to produce probabilities and are only unique up to a point. For example if two probability density functions $f(x)$ and $g(x)$ were such that they produced the same probabilities, or the same cumulative distribution function, for example, MATH for all $x,$ then we would not consider them distinct probability densities, even though $f(x)$ and $g(x)$ may differ at one or more values of $x$. Now when we parameterize a given statistical model using $\theta$ as the parameter, it is natural to do so in such a way that different values of the parameter lead to distinct probability density functions. This means, for example, that the cumulative distribution functions associated with these densities are distinct. Without this assumption, made in the following theorem, it would be impossible to accurately estimate the parameter since two different parameters could lead to the same cumulative distribution function and hence exactly the same behaviour of the observations.

Theorem B9

Suppose MATH is a random sample from a regular statistical model {MATH. Assume the densities corresponding to different values of the parameters are distinct. Let MATH$\ln$$f_{\theta}(X_{i})$. Then with probability tending to $1$ as $n\rightarrow\infty$, the likelihood equation MATH has a root $\hat{\theta}_{n}$ such that $\hat{\theta}_{n}$ converges in probability to $\theta_{0}$, the true value of the parameter, as $n\rightarrow\infty$.




The likelihood equation above does not always have a unique root. The consistency of the maximum likelihood estimator is one indication that it performs reasonably well. However, it provides no reason to prefer it to some other consistent estimator. The following result indicates that maximum likelihood estimators perform as well as any reasonable estimator can, at least in the limit as $n\rightarrow\infty$. Most of the proofs of these asymptotic results can be found in Lehmann(1991).

Theorem B10

Suppose MATH is a random sample from a regular statistical model $\{f_{\theta}(x);$ $\theta\in\Omega\}$. Suppose $\hat{\theta}_{n}$ is a consistent root of the likelihood equation as in the theorem above. Let MATH the Fisher information for a sample of size one. Then MATH where $\theta_{0}$ is the true value of the parameter.




This result may also be written as MATH


This theorem asserts that, at least under the regularity required, the maximum likelihood estimator is asymptotically unbiased. Moreover, the asymptotic variance of the maximum likelihood estimator approaches the Cramér-Rao lower bound for unbiased estimators. This justifies the comparison of the variance of an estimator $T_{n}$ based on a sample of size $n$ to the value MATH, which is the asymptotic variance of the maximum likelihood estimator and also the Cramér-Rao lower bound.

It also follows that MATH This indicates that the asymptotic variance of any function MATH of the maximum likelihood estimator also achieves the Cramér-Rao lower bound.

Definition

Suppose $T_{n}$ is asymptotically normal with mean $\theta_{0}$ and variance $\sigma_{T}^{2}/n$. The asymptotic efficiency of $T_{n}$ is defined to be MATH. This is the ratio of the Cramér-Rao lower bound to the variance of $T_{n}$ and is typically less than one, close to one indicating the asymptotic efficiency is close to that of the maximum likelihood estimator.

The Multiparameter Case

In the case MATH, the score function is the vector of partial derivatives of the log likelihood with respect to the components of $\theta $. Therefore the likelihood equation is $k$ equations in the $k$ unknown parameters. Under similar regularity conditions to the univariate case, the conclusion of Theorem B9 holds in this case, that is, the components of $\hat{\theta}_{n}$ each converge in probability to the corresponding component of $\theta _{0}$. Similarly, the asymptotic normality remains valid in this case with little modification. Let $J_{1}(\theta )$ be the Fisher information matrix for a sample of size one and assume it is a non-singular matrix. Then MATHwhere the multivariate normal distribution with $k$-dimensional mean vector $\mu $ and covariance matrix $B$ ($k\times k$) , denoted $MVN(\mu ,B)$ has probability density function defined on $\QTR{cal}{R}^{k}$, MATH

It also follows that MATH where MATH. Once again the asymptotic variance-covariance matrix is identical to the lower bound given by the multiparameter case of the Information Inequality.




Joint confidence regions can be constructed based on one of the asymptotic results MATHMATH or MATH Confidence intervals for a single parameter, say $\theta_{i}$, can be based on the approximate normality of MATH or MATH where $(a)_{i}$ is the $ith$ entry in the vector $a\ $and $[A^{-1}]_{ii}$ is the $(i,i)$ entry in the matrix $A^{-1}$.

Unidentifiability and Singular Information Matrices

Suppose we observe two independent random variables $Y_{1},Y_{2}$ having normal distributions with the same variance $\sigma^{2}$ and means MATH MATH respectively. In this case, although the means depend on the parameter MATH, the value of this vector parameter is unidentifiable in the sense that, for some pairs of distinct parameter values, the probability density function of the observations are identical. For example the parameter $(1,0,1)$ leads to exactly the same joint distribution of $Y_{1},Y_{2}$ as does the parameter $(0,1,0)$. In this case, we we might consider only the two parameters MATH and anything derivable from this pair estimable, while parameters such as $~\theta_{2}~$ that cannot be obtained as functions of $\phi_{1},\phi_{2}$ are consequently unidentifiable. The solution to the original identifiability problem is the reparametrization to the new parameter MATH in this case, and in general, unidentifiability usually means one should seek a new, more parsimonious parametrization.

In the above example, compute the Fisher information matrix for the parameter MATH. Notice that the Fisher information matrix is singular. This means that if you were to attempt to compute the asymptotic variance of the maximum likelihood estimator of $\theta$ by inverting the Fisher information matrix, the inversion would be impossible. Attempting to invert a singular matrix is like attempting to invert the number 0. It results in one or more components that you can consider to be infinite. Arguing intuitively, the asymptotic variance of the maximum likelihood estimator of some of the parameters is infinite. This is an indication that asymptotically, at least, some of the parameters may not be identifiable. When parameters are unidentifiable, the Fisher information matrix is generally singular. However, when $J(\theta)$ is singular for all values of $\theta$, this may or may not mean parameters are unidentifiable for finite sample sizes, but it does usually mean one should take a careful look at the parameters with a possible view to adopting another parametrization.

U.M.V.U.E.'s and maximum likelihood Estimators: A Comparison

Which of the two main types of estimators should we use? There is no general consensus among statisticians.

  1. If we are estimating the expectation of a natural sufficient statistic $T_{i}(X)$ in a regular exponential family both maximum likelihood and unbiasedness considerations lead to the use of $T_{i}$ as an estimator.

  2. When sample sizes are large U.M.V.U.E's and maximum likelihood estimators are essentially the same. In that case use is governed by ease of computation. Unfortunately how large ``large'' needs to be is usually unknown. Some studies have been carried out comparing the behaviour of U.M.V.U.E.'s and maximum likelihood estimators for various small fixed sample sizes. The results are, as might be expected, inconclusive.

  3. maximum likelihood estimators exist ``more frequently'' and when they do they are usually easier to compute than U.M.V.U.E.'s. This is essentially because of the appealing invariance property of maximum likelihood estimators.

  4. Simple examples are known for which maximum likelihood estimators behave badly even for large samples. This is more often the case when there is a large number of parameters, some of which, termed ``nuisance parameters'' are of no direct interest, but complicate the estimation.

  5. U.M.V.U.E.'s and maximum likelihood estimators are not necessarily robust. A small change in the underlying distribution or the data could result in a large change in the estimator.

Other Estimation Criteria

Best Linear Unbiased Estimators

The problem of finding best unbiased estimators is considerably simpler if we limit the class in which we search. If we permit any function of the data, then we usually require the heavy machinery of complete sufficiency to produce U.M.V.U.E.'s. However, the situation is much simpler if we suggest some initial random variables and then require that our estimator be a linear combination of these. Suppose, for example we have random variables $Y_{1},Y_{2},Y_{3}$ with MATH MATH $E(Y_{3})=\theta $ where $\theta $ is the parameter of interest and $\alpha $ is another parameter. What linear combinations of the $Y_{i}$'s provide an unbiased estimator of $\theta $ and among these possible linear combinations which one has the smallest possible variance? To answer these questions, we need to know the covariances $Cov(Y_{i},Y_{j})$ (at least up to some scalar multiple). Suppose MATH $\ i\neq j$ and varMATH. Let MATH and MATH We can write the model in a form reminiscent of linear regression as MATHwhere MATHMATH and the $\epsilon _{i}$'s are uncorrelated random variables with $E(\epsilon _{i})=0$ and varMATH. Then the linear combination of the components of $Y$ that has the smallest variance among all unbiased estimators of $\beta $ is given by the usual regression formula MATH and MATH provides the best estimator of $\theta $ in the sense of smallest variance. In other words, the linear combination of the components of $Y$ which has smallest variance among all unbiased estimators of $a^{\prime }\beta $ is MATH where $a^{\prime }=(0,1)$. In the above example, we may compute the Fisher information matrix for the parameter MATH as follows.

The log likelihood is MATHand the Fisher information is the covariance matrix of the score vector MATHand this is MATHNotice that $J(\theta )$ is, in this case, singular. If you were to attempt to compute the asymptotic variance of the maximum likleihood estimator of $\theta $ by inverting this information matrix, the inversion is impossible. Attempting to invert a singular matrix is like attempting the inverse of $0,$ , one or more components of the inverse can be taken to be infinite, indicating that, asymptotically at least, one of more of the parameters is unidentifiable.

More generally, we wish to consider a number $n$ of possibly dependent random variables $Y_{i}$ whose expectations may be related to a parameter $\theta$. These may, for example, be individual observations or a number of competing estimators constructed from these observations. We assume MATH has expectation given by MATH where $X$ is some $n\times k$ matrix having rank $k$ and MATH is a vector of unknown parameters. As in multiple regression, the matrix $X$ is known and non-random. Suppose the covariance matrix of $Y$ is $\sigma^{2}B$ with $B$ a known non-singular matrix and $\sigma^{2}$ a possibly unknown scalar parameter. We wish to estimate a linear combination of the components of $\beta$, say MATH where $a$ is a known $k$-dimensional column vector. We restrict our attention to unbiased estimators of $\theta$.

Gauss-Markov Theorem B11

Theorem B11: Gauss-Markov Theorem

Suppose $Y$ is a random vector with mean and covariance matrix MATHwhere matrices $X$ and $B$ are known and the parameters $\beta $ and $\sigma ^{2}$ unknown. Suppose we wish to estimate a linear combination MATH of the components of $\beta $. Then among all linear combinations of the components of $Y$ which are unbiased estimators of the parameter $\theta ,$ the estimator MATHhas the smallest variance.

Note that this result does not depend on any assumed normality of the components of $Y$ but only on the first and second moment behaviour, that is, the mean and the covariances. The special case when $B$ is the identity matrix is the least squares estimator.

Estimating Equations

To find the maximum likelihood estimator, we usually solve the likelihood equation MATH Note that the function on the left hand side is a function of both the observations and the parameter. Such a function is called an estimating function. Most sensible estimators, like the maximum likelihood estimator, can be described easily through an estimating function. For example, if we know varMATH for independent identically distributed $~X_{i}$, then we can use the estimating function MATH to estimate the parameter $\theta$, without any other knowledge of the distribution, its density, mean etc. The estimating function is set equal to 0 and solved for $\theta$. The above estimating function is an unbiased estimating function in the sense that MATH This allows us to conclude that the function is at least centered appropriately for the estimation of the parameter $\theta$. Now suppose that $\psi$ is an unbiased estimating function corresponding to a large sample. Often it can be written as the sum of independent components, for example MATH Now suppose $\hat{\theta}$ is a root of the estimating equation MATH Then for $\theta$ sufficiently close to $\hat{\theta}$, MATH Uusing the Central Limit Theorem, assuming that $\theta$ is the true value of the parameter and provided $\psi$ is a sum as in (B3.5), the left hand side of (ef2) is approximately normal with mean $0$ and variance equal to varMATH. The term MATH is also a sum of similar derivatives of the individual $\psi_{i}$. If a law of large numbers applies to these terms, then when divided by $n$ this sum will be asymptotically equivalent to MATH. It follows that the root $\hat{\theta}$ will have an approximate normal distribution with mean $\theta$ and variance MATH By analogy with the relation between asymptotic variance of the maximum likelihood estimator and the Fisher information, we call the reciprocal of the above asymptotic variance formula the Godambe information of the estimating function. This information measure is MATH Godambe(1960) proved the following result.

Theorem B12

Among all unbiased estimating functions satisfying the usual regularity conditions, an estimating function which maximizes the Godambe information (B3.6) is of the form MATH where $c(\theta)$ is non-random.

Bayesian Methods

There are two major schools of thought on the way in which statistical inference is conducted, the frequentist and the Bayesian school. Typically, these schools differ slightly on the actual methodology and the conclusions that are reached, but more substantially on the philosophy underlying the treatment of parameters. So far we have considered a parameter as an unknown constant underlying or indexing the probability density function of the data. It is only the data, and statistics derived from the data that are random.

The Bayesian begins with the assertion that the parameter $\theta$ obtains as the realization of some larger random experiment. The parameter is assumed to have been generated according to some distribution, the prior distribution $\pi$ and the observations then obtained from the corresponding probability density function $f_{\theta}$ interpreted as the conditional probability density of the data given the value of $\theta$. The prior distribution $\pi(\theta)$ quantifies information about $\theta$ prior to any further data being gathered. Sometimes $\pi(\theta)$ can be constructed on the basis of past data. For example, if a quality inspection program has been running for some time, the distribution of the number of defectives in past batches can be used as the prior distribution for the number of defectives in a future batch. The prior can also be chosen to incorporate subjective information based on an expert's experience and personal judgement. The purpose of the data is then to adjust this distribution for $\theta$ in the light of the data, to result in the posterior distribution for the parameter. Any conclusions about the plausible value of the parameter are to be drawn from the posterior distribution. For a frequentist, statements like $P(1<\theta<2)$ are meaningless; all randomness lies in the data and the parameter is an unknown constant. Frequentists are careful to assure students that if an observed 95% confidence interval for the parameter is $1<\theta<2$ this does not imply $P(1<\theta<2)=0.95$. However, a Bayesian will happily quote such a probability, usually conditionally on some observations, for example, MATH. In spite of some distance in the philosophy regarding the (random?) nature of statistical parameters, the two paradigms tend to largely agree for large sample sizes because the prior assumptions of the Bayesian tend to be a small contributor to the conclusion.

Posterior Distributions

Suppose the parameter is initially chosen at random according to the prior distribution $\pi(\theta)$ and then given the value of the parameter the observations are independent identically distributed, each with conditional probability (density) function $f_{\theta}(x)$. Then the posterior distribution of the parameter is the conditional distribution of $\theta$ given the data MATH MATH where MATH is independent of $\theta$ and $L(\theta)$ is the likelihood function. Since Bayesian inference is based on the posterior distribution it depends only on the data through the likelihood function.

Example

Suppose a coin is tossed $n$ times with probability of heads $\theta$. It is known from my ``very considerable previous experience with coins'' that the prior probability of heads is not always identically $1/2$ but follows a BETA$(10,10)$ distribution. If the $n$ tosses result in $x$ heads, we wish to find the posterior density function for $\theta$. In this case the prior distribution for the parameter $\theta$ is the Beta(10,10) distribution with probability density function MATH The posterior distribution of $\theta$ is therefore proportional to MATH where the constant $C$ may depend on $x$ but dos not depend on $\theta.$ Therefore the posterior distribution is also a Beta distribution but with parameters $(10+x,10+n-x).$ Notice that the posterior mean is the expected value of this beta distribution and is MATH which, for $n$ and $x$ sufficiently large, is reasonably close to the usual estimator $x/n.$

Conjugate Prior Distributions

If a prior distribution has the property that the posterior distribution is in the same family of distributions as the prior then the prior is called a conjugate prior.

Suppose MATH is a random sample from the exponential family MATH and $\theta$ is assumed to have the prior distribution with parameters $a,b$ given by MATH where MATH Then the posterior distribution of $\theta$, given the data MATH is easily seen to be given by MATH where MATH Notice that the posterior distribution is in the same family of distributions as (3.8) and thus $\pi(\theta)$ is a conjugate prior. The value of the parameters of the posterior distribution reflect the choice of parameters in the prior.

Example

To find the conjugate prior for $\theta=$($\alpha,\beta)$ for a random sample MATH from the beta($\alpha,\beta)$ distribution with probability density function MATH we begin by writing this in exponential family form,MATH Then the conjugate prior distribution is the joint probability density function $\pi(\alpha,\beta)$ on ($\alpha,\beta)$ which is proportional to MATH for parameters $a,b_{1},b_{2}.$ The posterior distribution takes the same form as (priorbeta) but with the parameters $a,b_{1},b_{2}$ replaced by MATH Bayesians are sometimes criticised for allowing their subjective opinions (in this case leading to the choice of the prior parameters $a,b_{1},b_{2}$ influence the resulting inference but notice that in this case, and more generally, as the sample size $n$ grows, the value of the parameters of the posterior distribution is mostly determined by the components MATH above which grow in $n,$ eventually washing out the influence of the choice of prior parameters.

Noninformative Prior Distributions

The choice of the prior distribution to be the conjugate prior is often motivated by mathematical convenience. However, a Bayesian would also like the prior to accurately represent the preliminary uncertainty about the plausible values of the parameter, and this may not be easily translated into one of the conjugate prior distributions. Noninformative priors are the usual way of representing ignorance about $\theta$ and they are frequently used in practice. It can be argued that they are more objective than a subjectively assessed prior distribution since the latter may contain personal bias as well as background knowledge. Also, in some applications the amount of prior information available is far less than the information contained in the data. In this case there seems little point in worrying about a precise specification of the prior distribution.

In the coin tossing example above, we assumed a Beta(10,10) prior distribution for the probability of heads. If were no reason to prefer one value of $\theta$ over any other then a noninformative or `flat' prior disribution for $\theta$ that could be used is the UNIF$(0,1)$ distribution, also as it turns out a special case of the beta distribution. Ignorance may not be bliss but for Bayesians it is most often uniformly distributed. For estimating the mean $\theta$ of a N$(\theta,1)$ distribution the possible values for $\theta$ are $(-\infty,\infty)$. If we take the prior distribution to be uniform on $(-\infty,\infty)$, that is, MATH then this is not a proper probability density since MATH Prior densities of this type are called improper priors. In this case we could consider a sequence of prior distributions such as the UNIF$(-M,M)$ which approximates this prior as $M\rightarrow\infty$. Suppose we call such a prior density function $~\pi_{M}$. Then the posterior distribution of the parameter is given by MATH and it is easy to see that as $M\rightarrow\infty$, this approaches a constant multiple of the likelihood function $L(\theta)$. For reasonably large sample size, $L(\theta)$ is often integrable and can therefore be normalized to produce a proper posterior distribution, even though the corresponding prior was improper. This Bayesian development provides an alternate interpretation of the likelihood function. We can consider it as proportional to the posterior distribution of the parameter when using a uniform improper prior on the whole real line. The language is somewhat sloppy here since, as we have seen, the uniform distribution on the whole real line really makes sense only through taking limits for uniform distributions on finite intervals.

In the case of a scale parameter, which must take positive values such as the normal variance, it is usual to express ignorance of the prior distribution of the parameter by assuming that the logarithm of the parameter is uniform on the real line.

One possible difficulty with using nonformative prior distributions is the concern whether the prior distribution should be uniform for $\theta$ itself or some function of $\theta$, such as $\theta^{2}$ or $\log(\theta).$ The objective when we used a uniform prior for a probability was to add no more information about the parameter around one possible value than around some other, and so it makes sense to use a uniform prior for a parameter that essentially has uniform information attached to it. For this reason, it is common to use a uniform prior for $\tau=h(\theta)$ where $h(\theta)$ is the function of $\theta$ whose Fisher information, $J^{\ast}(\tau),$ is constant. This idea is due to Jeffreys and leads to a prior distribution which is proportional to $[J(\theta)]^{1/2}.$ Such a prior is referred to as a Jeffreys' prior. The reparametrization which leads to a Jeffrey's prior can be carried out as follows: suppose $\{f_{\theta}(x);$ $\theta \in\Omega\}$ is a regular model and MATH is the Fisher information for a single observation. Then if we choose an abitrary value for $\theta_{0}$ and define the reparameterization MATH Then in this case, the Fisher information for the parameter $\tau$, MATH equals one for all values of $\tau$ and so Jeffry's prior corresponds to using a uniform prior distribution on the values of $\tau.$ Since the asymptotic variance of the maximum likelihood estimator $\hat{\tau}_{n}$ is equal to $1/n$, which does not depend on $\tau ,$ (3.9) is often called a variance stabilizing transformation.

Bayes Point Estimators

One method of obtaining a point estimator of $\theta$ is to use the posterior distribution and a suitable loss function.

Theorem B13

The Bayes estimator of $\theta$ for squared error loss with respect to the prior $\pi(\theta)$ given data $X$ is the mean of the posterior distribution given by MATH This estimator minimizes MATH

Example

Suppose MATH is a random sample from the distribution with probability density function MATH Using a conjugate prior for $\theta$ find the Bayes estimator of $\theta$ for squared error loss.

We begin by identifying the conjugate prior distribution. Since MATH the conjugate prior density is MATH which is evidently a Gamma distribution restricted to the interval $(1,\infty)$ and if the prior is to be proper, the parameters must be chosen such that MATH so $b\leq0.$ Then the posterior distribution takes the same form as the prior but with $a$ replaced by $a+n$ and $b$ by MATH The Bayes estimate of $\theta$ for squared error loss is the mean of this posterior distribution, or MATH

Bayesian Interval estimates

There remains, after many decades, a controversy between Bayesians and frequentists about which approach to estimation is more suitable to the real world. The Bayesian has advantages at least in the ease of interpretation of the results. For example, a Bayesian can use the posterior distribution given the data MATH to determine points $c_{1}=c_{1}(x),$ $c_{2}=c_{2}(x)$ such that MATH and then give a Bayesian confidence interval $(c_{1},c_{2})$ for the parameter. If this results in the interval $(2,5)$ the Bayesian will state that (in a Bayesian model, subject to the validity of the prior) the conditional probability given the data that the parameter falls in the interval $(2,5)$ is $0.95$. No such probability can be ascribed to a confidence interval for frequentists, who see no randomness in the parameter to which this probability statement is supposed to apply. Bayesian confidence regions are also called credible regions in order to make clear the distinction between the interpretation of Bayesian confidence regions and frequentist confidence regions.

Suppose $\pi(\theta|x)$ is the posterior distribution of $\theta$ given the data MATH and $A$ is a subset of $\Omega$. If MATH then $A$ is called a $p$ credible region for $\theta.$ A credible region can be formed in many ways. If $(a,b)$ is an interval such that MATH then $(a,b)$ is called a $p$ equal-tailed credible region. A highest posterior density (H.P.D.) credible region is constructed in a manner similar to likelihood regions. The $p$ highest posterior density. credible region is given by MATH where $c$ is chosen such that MATH A highest posterior density credible region is optimal in the sense that it is the shortest $p$ credible interval for a given value of $p$.

Example

Suppose MATH is a random sample from the N$(\mu ,\sigma ^{2})$ distribution where $\sigma ^{2}$ is known and $\mu $ has the conjugate prior. Find the $p=0.95$ H.P.D. credible region for $\mu $. Compare this to a $95\%$ C.I. for $\mu .$
Suppose the prior distribution for $\mu $ is MATH so the prior density is given by MATHand the posterior density by MATHwhere the constants $C_{1},C_{2}$ and $C_{3}$ depend on MATH but not on $\mu $ and where MATHTherefore the posterior distribution of $\mu $ is MATH It follows that the 0.95 H.P.D. credible region is of the form MATHNotice that as MATH the weight $w\rightarrow 1$ and so MATH is asymptotically equivalent to the sample mean $\overline{X}.$ Similarly, as MATH MATH is asymptotically equivalent to $\sigma ^{2}/n$. This means that for large values of $n,$ the H.P.D. region is close to the region MATHand the latter is the 95% confidence interval for $\mu $ based on the normal distribution of the maximum likelihood estimator $\overline{X}.$

Finally, although statisticians argue whether the Bayesian or the frequentist approach is better, there is really no one right way to do statistics. There is something fundamentalist about the Bayesian paradigm, (though the Reverand Bayes was, as far as we know, far from a fundamentalist) in that it places all objects, parameters and data, in much the same context and treats them similarly. It is a coherent philosophy of statistics, and a Bayesian will vigorously argue that there is an inconsistency in regarding some unknowns as random and others deterministic. There are certainly instances in which a Bayesian approach seems more sensible-- particularly for example if the parameter is a measurement on a possibly randomly chosen individual (say the expected total annual claim of a client of an insurance company).

Hypothesis Tests

Statistical estimation usually concerns the estimation of the value of a parameter when we know little about it except perhaps that it lies in a given parameter space, and when we have no a priori reason to prefer one value of the parameter over another. If, however, we are asked to decide between two possible values of the parameter, the consequences of one choice of the parameter value may be quite different from another choice. For example, if we believe $Y_{i}$ is normally distributed with mean $\alpha+\beta x_{i}$ and variance $\sigma^{2}$ for some explanatory variables $x_{i}$, then the value $\beta=0$ means there is no relation between $Y_{i}$ and $x_{i}$. We need neither collect the values of $x_{i}$ nor build a model around them. Thus the two choices $\beta=0$ and $\beta=1$ are quite different in their consequences. This is often the case.

A hypothesis test involves a (usually natural) separation of the parameter space $\Omega$ into two disjoint regions, $\Omega_{0}$ and $\Omega-\Omega_{0} $. By the difference between the two sets we mean those points in the former $(\Omega)$ that are not in the latter $(\Omega_{0})$. This partition of the parameter space corresponds to testing the null hypothesis that the parameter is in MATH. We usually write this hypothesis in the form MATH The null hypothesis is usually the status quo. For example in a test of a new drug, the null hypothesis would be that the drug had no effect, or no more of an effect than drugs already on the market. The null hypothesis is only rejected if there is reasonably strong evidence against it. The alternative hypothesis determines what departures from the null hypothesis are anticipated. In this case, it might be simply MATH Since we do not know the true value of the parameter, we must base our decision on the observed value of $X$. The hypothesis test is conducted by determining a partition of the sample space into two sets, the critical or rejection region $R$ and its complement $\bar{R}$ which is called the acceptance region. We declare that $H_{0}$ is false (in favour of the alternative) if we observe $x\in R$.

Definition

The power function of a test with critical region $R$ is the function MATH or the probability that the null hypothesis is rejected as a function of the parameter.

It is obviously desirable, in order to minimize the two types of possible errors in our decision, for the power function $\beta(\theta)$ to be small for MATH but large otherwise. The probability of rejecting the null hypothesis when it is true (type I error) is a particularly important type of error which we attempt to minimize. This probability determines one important measure of the performance of a test, the level of significance.

Definition

A test has level of significance $\alpha$ if MATH for all MATH.

The level of significance is simply an upper bound on the probability of a type I error. There is no assurance that the upper bound is tight, that is, that equality is achieved somewhere. The lowest such upper bound is often called the size of the test.

Definition

The size of a test is equal to MATH.

Uniformly Most Powerful Tests

Tests are often constructed by specifying the size of the test, which in turn determines the probability of the type I error, and then attempting to minimize the probability that the null hypothesis is accepted when it is false (type II error). Equivalently, we try and maximize the power function of the test for MATH.

Definition

A test with power function $\beta(\theta)$ is a uniformly most powerful (U.M.P.) test of size $\alpha$ if, for all other tests of the same size $\alpha$ having power function MATH, we have MATH for all MATH.

The word ``uniformly'' above refers to the fact that one function dominates another, that is, MATH uniformly for all MATH. When the alternative MATH consists of a single point $\{\theta_{1}\}$ then the construction of a best test is particularly easy. In this case, we may drop the word ``uniformly'' and refer to a ``most powerful test''. The construction of a best test, by this definition, is possible under rather special circumstances. First, we often require a simple null hypothesis. This is the case when MATH consists of a single point $\{\theta_{0}\}$ and so we are testing the null hypothesis MATH.

Neyman-Pearson Lemma B14

Let $X$ have probability (density) function $f_{\theta}(x),$ $\theta\in\Omega $. Consider testing a simple null hypothesis MATH against a simple alternative MATH. For a constant $c$, suppose the critical region defined by MATH corresponds to a test of size $\alpha$. Then the test with this critical region is a most powerful test of size $\alpha$ for testing MATH against MATH.


Proof:

Consider another critical region $R_{1}$ with the same size. Then MATH Therefore MATH and MATH

For MATH, MATH and thus MATH For MATH MATH, and thus

MATH

Now MATH and MATH Therefore, using (4.1), (4.2), and (4.3) we have MATH and the test with critical region $R$ is therefore the most powerful.

Example

Suppose we anticipate collecting daily returns from the past $n$ days of a stock, MATH assumed to be distributed according to a NormalMATH distribution. Here $\Delta$ is the length of a day measured in years, $\Delta\simeq1/252$ and $\mu,\sigma2$ are the annual drift and volatility parameters. We wish to test whether the stock has zero or positive drift, so we wish to test the hypothesis $H_{0}:\mu=0$ against the alternative $H_{1}:\mu>0$ at level of significance $\alpha$. We want the probability of the incorrect decision when the drift is 20% per year to be small, so let us choose it to be $\alpha$ as well, which means that when $\mu=0.2,$ the power of the test should be at least $1-\alpha.$ How large a sample must be taken in order to insure this?

The test itself is easy to express. We reject the null hypothesis if MATH where the value $z_{\alpha}$ has been chosen so that MATH when $Z$ has a standard normal distribution. The power of the test is the probability MATH when the parameter $\mu_{1}=0.2,$ and this is MATH where $Z$ has a standard normal distribution. Since we want the power to be $1-\alpha,$ the value MATH must be chosen to be $-z_{\alpha}.$ Solving for the value of $n,$MATH Now if we try some reasonable values for the parameters, for example $\sigma^{2}=0.2,$ $\Delta=1/252,$ $\mu_{1}=0.2,$ $\alpha=0.05$, then $n\simeq14,000,$ which is about 55 years worth of data, far larger a sample than we could hope to collect. This example shows that the typical variabilities in the market are so large, compared with even fairly high rates of return, that it is almost impossible to distinguish between theoretical rates of return of 0% and 20% per annum using a hypothesis test with daily data.

Relationship Betweeen Hypothesis Tests and C.I.'s

There is a close relationship between hypothesis tests and confidence intervals as the following example illustrates. Suppose MATH is a random sample from the N($\theta$,1) distribution and we wish to test the hypothesis MATH against MATH. The critical region MATH is a size $\alpha=0.05$ critical region which has a corresponding acceptance region MATH Note that the hypothesis MATH would not be rejected at the $0.05$ level if MATH or equivalently MATH which is a $95\%$ C.I. for $\theta.$

Problem

Let MATH be a random sample from the Gamma$(2,\theta)$ distribution. Show that MATH is a size $\alpha=0.05$ critical region for testing MATH. Show how this critical region may be used to construct a $95\%$ C.I. for $\theta.$

Likelihood Ratio Tests

Consider a test of the hypothesis MATH against MATH. We have seen that for prescribed MATH MATH, the most powerful test of the simple null hypothesis MATH against a simple alternative MATH is based on the likelihood ratio MATH. By the Neyman-Pearson Lemma it has critical region MATH whre $c$ is a constant determined by the size of the test. When either the null or the alternative hypothesis are composite (i.e. contain more than one point) and there is no uniformly most powerful test, it seems reasonable to use a test with critical region $R$ for some choice of MATH. The likelihood ratio test does this with $\theta_{1}$ replaced by $\hat{\theta}$, the maximum likelihood estimator over all possible values of the parameter, and $\theta_{0}$ replaced by the maximum likelihood estimator of the parameter when it is restricted to $\Omega_{0}$. Thus, the likelihood ratio test has critical region MATH where

MATH and $c$ is determined by the size of the test. In general, the distribution of the test statistic $\Lambda(X)$ may be difficult to find. Fortunately, however, the asymptotic distribution is known under fairly general conditions. In a few cases, we can show that the likelihood ratio test is equivalent to the use of a statistic with known distribution. However, in many cases, we need to rely on the asymptotic chi-squared distribution of Theorem 4.4.6.

Example

Let MATH be a random sample from the N$(\mu ,\sigma^{2})$ distribution where $\mu$ and $\sigma^{2}$ are unknown. Consider a test of MATH against the alternative MATH We can show that the likelihood ratio test of $H_{0}$ against $H_{1}$ has critical region MATH. Under $H_{0}$ that the statistic MATH has a F$(1,n-1)$ distribution and we can thus find a size $\alpha=0.05$ test for $n=20$.

Theorem B6

Suppose MATH is a random sample from a regular statistical model MATH with $\Omega$ an open set in $k-$dimensional Euclidean space. Consider a subset of $\Omega$ defined by MATH open subset of $q$-dimensional Euclidean space $\}$. Then the likelihood ratio statistic defined by MATH is such that, under the hypothesis MATH, MATH Note: The number of degrees of freedom is the difference between the number of parameters that need to be estimated in the general model, and the number of parameters left to be estimated under the restrictions imposed by $H_{0}$.

Significance Tests and p-values

We have seen that a test of hypothesis is a rule which allows us to decide whether to accept the null hypothesis $H_{0}$ or to reject it in favour of the alternative hypothesis $H_{1}$ based on the observed data. In situations in which $H_{1}$ is difficult to specify a test of significance could be used. A (pure) test of significance is a procedure for measuring the strength of the evidence provided by the observed data against $H_{0}$. This method usually involves looking at the distribution of a test statisitic or discrepancy measure $T$ under $H_{0}.$ The p-value or significance level for the test is the probability, computed under $H_{0},$ of observing a $T$ value at least as extreme as the value observed. The smaller the observed p-value, the stronger the evidence against $H_{0}$. The difficulty with this approach is how to find a statistic with `good properties'. The likelihood ratio statistic provides a general test statistic which may be used.

Score and Wald Tests

Score Test

Score tests can be viewed as a more general class of tests of MATH against MATH which tend to have considerable power provided that the values of the parameter under the null and the alternative are close. If the usual regularity conditions hold then under MATH we have MATH and thus MATH For a vector MATH we have MATH The test based on $R(\theta_{0};X)$ is called a (Rao) score test. It has critical region MATH where $c$ is determined by the size of the test, that is, $c$ satisfies $P(W>c)=\alpha$ where MATH The test based on $R(\theta _{0};X)$ is asymptotically equivalent to the likelihood ratio test.

Wald Test

Suppose that $\hat{\theta}$ is the maximum likelihood estimator of $\theta $ over all $\theta \in \Omega $ and we wish to test MATH against MATH If the usual regularity conditions hold then under MATH MATHA test based on the test statistic $W(\theta _{0};X)$ is called a Wald test. It has critical region MATHwhere $c$ is determined by the size of the test. Both the score test and the Wald test are asymptotically equivalent to the likelihood ratio test and the intuitive expanation for these equivalences are quite simple. For large values of the sample size $n,$ the maximum likelihood estimator MATH is close to the true value of the parameter $\theta _{0}$ and so the log likelihood can be approximated by the first two terms in the Taylor series expansion of MATH about MATH and so MATHsince MATHand the observed information MATH is asymptotically equivalent to the Fisher information $J(\theta _{0}).$ This verifies the equivalence of the likelihood ratio and the Wald test.