Appendix B: Statistics
The material in this appendix can be supplemented with the classic text in statistics: The Theory of Point Estimation Wiley, New York by E. Lehmann (1983).
In statistics, we often represent our data, in many cases a sample of size from some population as a random vector . The model, can be written in the form where is the parameter space or set of permissible values of the parameter and is the probability density function. A statistic, is a function of the data which does not depend on the unknown parameter . Although a statistic, is not a function of , its distribution can depend on An estimator is a statistic considered for the purpose of estimating a given parameter. One of our objectives is to find a ``good'' estimator of the parameter , in some sense of the word ``good''. How do we ensure that a statistic is estimating the correct parameter and not consistently too large or too small, and that as much variability as possible has been removed? The problem of estimating the correct parameter is often dealt with by requiring that the estimator be unbiased.
We will denote an expected value under the assumed parameter value by . Thus, in the continuous case and in the discrete case provided the integral/sum converges absolutely. In the discrete case, the probability function of under this parameter value
A statistic is an unbiased estimator of if for all .
For example suppose that are independent, each with the Poisson distribution with parameter . Notice that the statistic is such that and so is an unbiased estimator of This means that it is centered in the correct place, but does not mean it is a best estimator in any sense.
In Decision Theory, in order to determine whether a given estimator or statistic does well for estimating we consider a loss function or distance function between the estimator and the true value. Call this . Then this is averaged over all possible values of the data to obtain the risk: A good estimator is one with little risk, a bad estimator is one whose risk is high. One particular risk function is called mean squared error (M.S.E.) and corresponds to . The mean squared error has a useful decomposition into two components the variance of the estimator and the square of its bias:
For example, if has a Normal distribution, the mean squared error of is 1 for all because the bias is zero. On the other hand the estimator has bias and variance so the mean squared error is Obviously has smaller mean squared error provided that is around 0 (more precisely provided , but for large, is preferable. Of these two estimators, only is unbiased.
In general, in fact, there is usually no one estimator which outperforms all other estimators at all values of the parameter if we use mean squared error as our basis for comparison. In order to achieve an optimal estimator, it is unfortunately necessary to restrict ourselves to a specific class of estimators and select the best within the class. Of course, the best within this class will only be as good as the class itself (best in a class of one is not much of a recommendation), and therefore we must ensure that restricting ourselves to this class is not unduly restrictive. The class of all estimators is usually too large to obtain a meaningful solution. One common restriction is to the class of all unbiased estimators.
An estimator is said to be a uniformly minimum variance unbiased estimator (U.M.V.U.E.) of the parameter if
(i) it is an unbiased estimator of and
(ii) among all unbiased estimators of it has the smallest mean squared error and therefore the smallest variance.
A sufficient statistic is one that, from a certain perspective, contains all the necessary information for making inferences (e.g. estimating the parameter with a point estimator or confidence interval, conducting a test of a hypothesized value) about the unknown parameters in a given model. It is important to remember that a statistic is sufficient for inference on a specific parameter. It does not necessary contain all relevant information in the data for other inferences. For example if you wished to test whether the family of distributions is an adequate fit to the data ( a goodness of fit test) the sufficient statistic for the parameter in the model does not contain the relevant information.
Suppose the data is in a vector and is a sufficient statistic for . The intuitive basis for sufficiency is that if the conditional distribution of given does not depend on , then provides no additional value in addition to for estimating . The assumption is that random variables carry information on a statistical parameter only insofar as their distributions (or conditional distributions) change with the value of the parameter and that since, given we can randomly generate at random values for the without knowledge of the parameter and with the correct distribution, these randomly generated values cannot carry additional information. All of this, of course, assumes that the model is correct and is the only unknown. The distribution of given a sufficient statistic will often have value for other purposes, such as measuring the variability of the estimator or testing the validity of the model.
A statistic is sufficient for a statistical model if the distribution of the data given does not depend on the unknown parameter .
The use of a sufficient statistic is formalized in the The Sufficiency Principle, which states that if is a sufficient statistic for a model and are two different possible observations that have identical values of the sufficient statistic: then whatever inference we would draw from observing we should draw exactly the same inference from .
Sufficient statistics are not unique. For example if the sample mean is a sufficient statistic, then any other statistic, that allows us to obtain is also sufficient. This will include all one-to-one functions of (these are essentially equivalent) like and all statistics for which we can write for some, possibly many-to-one function . One result which is normally used to verify whether a given statistic is sufficient is the Factorization Criterion for Sufficiency: Suppose has probability density function and is a statistic. Then is a sufficient statistic for if and only if there exist two non--negative functions and so that we can factor the probability density function for all This factorization into two pieces, one which involves both the statistic and the unknown parameter and the other which may be a constant or depend on but does not depend on the unknown parameter, need only hold on a set of possible values of which carries the full probability. That is for some set with for all , we require
A statistic is a minimal sufficient statistic for if it is sufficient and if for any other sufficient statistic , there exists a function such that .
This definition says in effect that a minimal sufficient statistic can be recovered from any other sufficient statistic. A statistic implicitly partitions the sample space into events of the form for varying and if is minimal sufficient, it induces the coarsest possible partition (i.e. the largest possible sets) in the sample space among all sufficient statistics. This partition is called the minimal sufficient partition.
The property of completeness is one which is useful for determining the uniqueness of estimators and verifying in some cases that a minimal sufficient reduction has been found. It bears no relation to the notion of a complete market in Finance, or the mathematical notion of a complete metric space. Let denote the observations from a distribution with probability density function . Suppose is a statistic and , a function of , is an unbiased estimator of so that . Under what circumstances is this the only unbiased estimator which is a function of ? To answer this question, suppose and are both unbiased estimators of and consider the difference . Since and are both unbiased estimators of the parameter we have for all . Now if the only function which satisfies for all is the zero function , then the two unbiased estimators must be identical. A statistic with this property is said to be complete. Technically it is not the statistic that is complete, but the family of distributions of in the model
The statistic is complete if for any function implies
For example, let be a random sample from the Normal distribution. Consider . Then is sufficient for but is not complete. It is easy to see that it is not complete, because the function is a function of which has zero expectation for all values of and yet the function is not identically zero. The fact that the statistic is sufficient but not complete is a hint that further reduction is possible, that it is not minimal sufficient. In fact in this case, as we will show a little later, taking only the second component of namely provides a minimal sufficient, complete statistic.
If is a complete and sufficient statistic for the model , then is a minimal sufficient statistic for the model.
The converse to the above theorem is not true. Let be a random sample from the continuous uniform distribution on the interval . This distribution has probability density function Then using the factorization criterion above, the joint probability density function for a sample of independent observations from this density is where is one or zero as the inequality holds or does not hold and are the smallest and the largest values in the sample Obviously can be written as a function where and so is sufficient. Moreover it is not difficult to show that no further reduction (for example to alone) is possible or we can not longer provide such a factorization, so us minimal sufficient. Nevertheless, if and the function is defined by (clearly a non-zero function) then for all and therefore is not a complete statistic.
For any random variables
and
,
and
In much of what follows, we wish to be able to estimate a general
function of the unknown parameter like
instead of the parameter
itself. We have already seen that if
is a complete statistic, then there is at most one function of
that provides an unbiased estimator of any function of a given
In fact if we can find such a function,
then it automatically has minimum variance among all possible unbiased
estimators of
that are based on the same data.
If is a complete sufficient statistic for the model and , then is the U.M.V.U.E. of .
When we have a complete sufficient statistic, and we are able to find an unbiased estimator, even a bad one, of then there is a simple recipe for determining the U.M.V.U.E. of
If is a complete sufficient statistic for the model and is any unbiased estimator of , then is the U.M.V.U.E. of .
Note that we did not subscript the conditional expectation with because whenever is a sufficient statistic, the conditional distribution of given does not depend on the underlying value of the parameter
Suppose has a (joint) probability density function of the form for functions . Then we say that the density is a member of the exponential family of densities. We call the natural sufficient statistic.
A member of the exponential family could be re-expressed in different ways and so the natural sufficient statistic is not unique. For example we may multiply a given by a constant and divide the corresponding by the same constant, resulting in the same probability density function . Various other conditions need to be applied as well, for example to insure that the are all essentially different functions of the data. One of the important properties of the exponential family is its closure under repeated independent sampling. In general if are independent identically distributed with an exponential family distribution then their joint distribution is also an exponential family distribution.
Let
(
be a random sample from the distribution with probability density function
given by (exfd). Then
also has an exponential family form, with joint probability density function
In other words,
is replaced by
and
by
.
The natural sufficient statistic is
It is usual to reparameterize equation (exfd) by replacing by a new parameter . This results in a more efficient representation, the canonical form of the exponential family density: The natural parameter space in this form is the set of all values of for which the above function is integrable; that is We would like this parameter space to be large enough to allow intervals for each of the components of the vector and so we will later need to assume that the natural parameter space contains a dimensional rectangle.
If the statistic satisfies a linear constraint, for example, with probability one, then the number of terms could be reduced and a more efficient representation of the probability density function is possible. Similarly if the parameters satisfy a linear relationship, they are not all statistically meaningful because one of the parameters is obtainable from the others. These are all situations that we would handle by reducing the model to a more efficient and non-redundant form. So in the remaining, we will generally assume such a reduction has already been made and that the exponential family representation is minimal in the sense that neither the nor the satisfy any linear constraints.
We will say that has a regular exponential family distribution if it is in canonical form, is of full rank in the sense that neither the nor the satisfy any linear constraints permitting a reduction in the value of , and the natural parameter space contains a dimensional rectangle.
By Theorem B5, if has a regular exponential family distribution then also has a regular exponential family distribution.
The main advantage identifying a distribution as a member of the regular exponential family is that it allows to to quickly identify the minimal sufficient statistic and conclude that it is complete.
If has a regular exponential family distribution then is a complete sufficient statistic.
Let be independent observations all from the normal distribution. Notice that with the parameter we can write the probability density function of each as a constant so the natural parameters are and and the natural sufficient statistic is ( For a sample of size from this density we have the same natural parameters, and, by the above theorem, a complete sufficient statistic is ( For example if you wished to find a U.M.V.U.E. of any function of , for example the parameter we need only find some function of the compete sufficient statistic which has the correct expected value. For example, in this case, with the sample mean and the sample variance it is not difficult to show that and so, provided is an unbiased estimator and a function of the complete sufficient statistic so it is the desired U.M.V.U.E. Suppose one of the parameters, say is assumed known. Then the normal distribution is still in the regular exponential family, since it has a representation with the function completely known. In this case, for a sample of size from this distribution, the statistic is complete sufficient for and so any function of it, say which is an unbiased estimator of is automatically U.M.V.U.E.
The Table below gives various members of the regular exponential family and the corresponding complete sufficient statistic.
Members of the Regular Exponential Family | Complete Sufficient Statistic | |
Binomial | ||
Geometric( | ||
if known | ||
known | ||
(includes exponential) | if known | |
if known | ||
For a regular exponential family, it is possible to differentiate under the integral, that is, for any and any in the interior of the natural parameter space.
Let denote observations from a distribution with probability density function and let be a statistic. The information on the parameter is provided by the sensitivity of the distribution of a statistic to changes in the parameter. For example, suppose a modest change in the parameter value leads to a large change in the expected value of the distribution resulting in a large shift in the data. Then the parameter can be estimated fairly precisely. On the other hand, if a statistic has no sensitivity at all in distribution to the parameter, then it would appear to contain little information for point estimation of this parameter. A statistic of the second kind is called an ancillary statistic.
is an ancillary statistic if its distribution does not depend on the unknown parameter .
Ancillary statistics are, in a sense, orthogonal or perpendicular to minimal sufficient statistics and are analogous to the residuals in a multiple regression, while the complete sufficient statistics are analogous to the estimators of the regression coefficients. It is well-known that the residuals are uncorrelated with the estimators of the regression coefficients (and independent in the case of normal errors). However, the ``irrelevance'' of the ancillary statistic seems to be limited to the case when it is not part of the minimal (preferably complete) sufficient statistic as the following example illustrates.
Suppose a fair coin is tossed to determine a random variable with probability and otherwise. We then observe a Binomial random variable with parameters . Then the minimal sufficient statistic is but is an ancillary statistic since its distribution does not depend on the unknown parameter . Is completely irrelevant to inference about ? If you reported to your boss an estimator of such as without telling him or her the value of how long would you expect to keep your job? Clearly any sensible inference about should include information about the precision of the estimator, and this inevitably requires knowing the value of Although the distribution of does not depend on the unknown parameter so that is ancillary, it carries important information about precision. The following theorem allows us to use the properties of completeness and ancillarity to prove the independence of two statistics without finding their joint distribution.
Consider with probability density function . Let be a complete sufficient statistic. Then is independent of every ancillary statistic .
Assume represents the market price of a given asset such as a portfolio of stocks at time and is the value of the portfolio at the beginning of a given time period (assume that the analysis is conditional on so that is fixed and known). The process is assumed to be a Brownian motion and so the distribution of for any fixed time is Normal for . Suppose that for a period of length 1, we record both the period high and the close . Define random variables and . Then the joint probability density function of can be shown to be
It is not hard to show that this is a member of the regular exponential family of distributions with both parameters assumed unknown. If one parameter is known, for example, , it is again a regular exponential family distribution with Consequently, if we record independent pairs of observations on the portfolio for a total of distinct time periods (and if we assume no change in the parameters), then the statistic is a complete sufficient statistic for the drift parameter . Since it is also an unbiased estimator of it is the U.M.V.U.E. of . By Basu's theorem it will be independent of any ancillary statistic, i.e. any statistic whose distribution does not depend on the parameter One such statistic is which is therefore independent of
Suppose we have observed independent discrete random variables all with probability density function where the scalar parameter is unknown. Suppose our observations are . Then the probability of the observed data is: When the observations have been substituted, this becomes a function of the parameter only, referred to as the likelihood function and denoted . Its natural logarithm is usually denoted . Now in the absence of any other information, it seems logical that we should estimate the parameter using a value most compatible with the data. For example we might choose the value maximizing the likelihood function or equivalently maximizing . We call such a maximizer the maximum likelihood (M.L.) estimate provided it exists and satisfies any restrictions placed on the parameter. We denote it by . Obviously, it is a function of the data, that is, . The corresponding estimator is . In practice we are usually satisfied with a local maximum of the likelihood function provided that it is reasonable, partly because the global maximization problem is often quite difficult, partly because the global maximum is not always better than a local maximum near a preliminary estimator that is known to be consistent.. In the case of a twice differentiable log likelihood function on an open interval, this local maximum is usually found by solving the equation for a solution , where is called the score function. The equation is called the (maximum) likelihood equation or score equation. To verify a local maximum we compute the second derivative and show that it is negative, or alternatively show . The function is called the information function. In a sense to be investigated later, , the observed information, indicates how much information about a parameter is available in a given experiment. The larger the value, the more curved is the log likelihood function and the easier it is to find the maximum.
Although we view the likelihood, log likelihood, score and information functions as functions of they are, of course, also functions of the observed data . When it is important to emphasize the dependence on the data we will write etc. Also when we wish to determine the sampling properties of these functions as functions of the random variable we will write etc.
The Fisher or expected information (function) is the expected value of the information function .
Suppose a random variable has a continuous probability density function with parameter . We will often observe only the value of rounded to some degree of precision (say 1 decimal place) in which case the actual observation is a discrete random variable. For example, suppose we observe correct to one decimal place. Then assuming the function is quite smooth over the interval. More generally, if we observe rounded to the nearest (assumed small) then the likelihood of the observation is approximately . Since the precision of the observation does not depend on the parameter, then maximizing the discrete likelihood of the observation is essentially equivalent to maximizing the the probability density function over the parameter. This partially justifies the use of the probability density function in the continuous case as the likelihood function.
Similarly, if we observed independent values of a continuous random variable, we would maximize the likelihood (or more commonly its logarithm) to obtain the maximum likelihood estimator of .
The relative likelihood function defined as is the ratio of the likelihood to its maximum value and takes on values between and . It is used to rank possible parameter values according to their plausibility in light of the data. If , say, then is rather an implausible parameter value because the data are ten times more likely when than they are when . The set of -values for which is called a likelihood region for When the parameter is one-dimensional, and is its true value, converges in distribution as the sample size to a chi-squared distribution with 1 degree of freedom. More generally, the numbers of degrees of freedom of the limiting chi-squared distribution is the dimension of the parameter We can use this to construct a confidence interval for the unknown value of the parameter. For example if is chosen to be the 0.95 quantile of the chi-squared(1) distribution (, then so a likelihood interval is an approximate confidence interval for . This seems to indicate that the confidence interval tolerates a considerable difference in the likelihood. The likelihood at a parameter value must differ from the maximum likelihood by a factor of more than before it is excluded by a confidence interval or rejected by a test with level of significance 5%.
Consider a continuous model with a family of probability density functions . Suppose all of the densities are supported on a common set . Then and therefore provided that the integral can be interchanged with the derivative. Models that permit this interchange, and calculation of the Fisher information, are called regular models.
Consider a statistical model with each density supported by a common set . Suppose is an open interval in the real line and for all and . Suppose in addition
is a continuous, three times differentiable function of for all .
for some function satisfying .
.
Then we call this a regular family of distributions or a regular model. Similarly, if these conditions hold with a discrete random variable and the integrals replaced by sums, the family is also called regular. Conditions like these permitting the interchange of expected values and derivative are sometimes referred to as the Cramer conditions. In general, they are used to justify passage of a derivative under an integral.
If is a random sample from a regular model then and
The case of several parameters is exactly analogous to the scalar parameter case. Suppose . In this case the ``parameter'' can be thought of as a column vector of scalar parameters. The score function is a dimensional column vector whose component is the derivative of with respect to the component of , that is, The observed information function is a matrix whose element is The Fisher information is a matrix whose components are component-wise expectations of the information matrix, that is The definition of a regular family of distributions is similarly extended. For a regular family of distributions and the covariance matrix of the score function var is the Fisher information, i.e.
Suppose has a regular exponential family distribution of the form Then and Therefore the maximum likelihood estimator of based on a random sample from is the solution to the equations The maximum likelihood estimators are obtained by setting the sample moments of the natural sufficient statistic equal to their expected values and solving.
Suppose that the maximum likelihood estimate is determined by the likelihood equation It frequently happens that an analytic solution for cannot be obtained. If we begin with an approximate value for the parameter, , we may update that value as follows: and provided that convergence of obtains, it converges to a solution to the score equation above. In the multiparameter case, where is a vector and is a matrix, then Newton's method becomes: In both of these, we can replace the information function by the Fisher information for a similar algorithm.
Suppose we consider estimating a parameter where is a scalar, using an unbiased estimator . Is there any limit to how well an estimator like this can behave? The answer for unbiased estimators is in the affirmative. A lower bound on the variance is given by the information inequality.
Suppose
is an unbiased estimator of the parameter
in a regular statistical model
.
Then
Equality holds if and only if
is regular exponential family with natural sufficient statistic
.
If equality holds in (CRLB) then we call an efficient estimator of . The number on the right hand side of (CRLB),
is called the Cramér-Rao lower bound (C.R.L.B.). We often express the efficiency of an unbiased estimator using the ratio of (C.R.L.B.) to the variance of the estimator. Large values of the efficiciency (i.e. near one) indicate that the variance of the estimator is close to the lower bound.
The special case of the information inequality that is of most interest is the unbiased estimation of the parameter . The above inequality indicates that any unbiased estimator of has variance at least . The lower bound is achieved only when is regular exponential family with natural sufficient statistic , so even in the exponential family, only certain parameters are such that we can find unbiased estimators which achieve the C.R.L.B., namely those that are expressible as the expected value of the natural sufficient statistics.
The right hand side in the information inequality generalizes naturally to the multiple parameter case in which is a vector. For example if , then the Fisher information is a matrix. If is any real-valued function of then its derivative is a column vector. Then if is any unbiased estimator of in a regular model, for all .
One of the more successful attempts at justifying estimators and demonstrating some form of optimality has been through large sample theory or the asymptotic behaviour of estimators as the sample size . One of the first properties one requires is consistency of an estimator. This means that the estimator converges to the true value of the parameter as the sample size (and hence the information) approaches infinity.
Consider a sequence of estimators where the subscript indicates that the estimator has been obtained from data with sample size . Then the sequence is said to be a consistent sequence of estimators of if for all .
It is worth a reminder at this point that probability density functions are used to produce probabilities and are only unique up to a point. For example if two probability density functions and were such that they produced the same probabilities, or the same cumulative distribution function, for example, for all then we would not consider them distinct probability densities, even though and may differ at one or more values of . Now when we parameterize a given statistical model using as the parameter, it is natural to do so in such a way that different values of the parameter lead to distinct probability density functions. This means, for example, that the cumulative distribution functions associated with these densities are distinct. Without this assumption, made in the following theorem, it would be impossible to accurately estimate the parameter since two different parameters could lead to the same cumulative distribution function and hence exactly the same behaviour of the observations.
Suppose is a random sample from a regular statistical model {. Assume the densities corresponding to different values of the parameters are distinct. Let . Then with probability tending to as , the likelihood equation has a root such that converges in probability to , the true value of the parameter, as .
The likelihood equation above does not always have a unique root. The consistency of the maximum likelihood estimator is one indication that it performs reasonably well. However, it provides no reason to prefer it to some other consistent estimator. The following result indicates that maximum likelihood estimators perform as well as any reasonable estimator can, at least in the limit as . Most of the proofs of these asymptotic results can be found in Lehmann(1991).
Suppose is a random sample from a regular statistical model . Suppose is a consistent root of the likelihood equation as in the theorem above. Let the Fisher information for a sample of size one. Then where is the true value of the parameter.
This result may also be written as
This theorem asserts that, at least under the regularity required, the maximum likelihood estimator is asymptotically unbiased. Moreover, the asymptotic variance of the maximum likelihood estimator approaches the Cramér-Rao lower bound for unbiased estimators. This justifies the comparison of the variance of an estimator based on a sample of size to the value , which is the asymptotic variance of the maximum likelihood estimator and also the Cramér-Rao lower bound.
It also follows that This indicates that the asymptotic variance of any function of the maximum likelihood estimator also achieves the Cramér-Rao lower bound.
Suppose is asymptotically normal with mean and variance . The asymptotic efficiency of is defined to be . This is the ratio of the Cramér-Rao lower bound to the variance of and is typically less than one, close to one indicating the asymptotic efficiency is close to that of the maximum likelihood estimator.
In the case , the score function is the vector of partial derivatives of the log likelihood with respect to the components of . Therefore the likelihood equation is equations in the unknown parameters. Under similar regularity conditions to the univariate case, the conclusion of Theorem B9 holds in this case, that is, the components of each converge in probability to the corresponding component of . Similarly, the asymptotic normality remains valid in this case with little modification. Let be the Fisher information matrix for a sample of size one and assume it is a non-singular matrix. Then where the multivariate normal distribution with -dimensional mean vector and covariance matrix () , denoted has probability density function defined on ,
It also follows that where . Once again the asymptotic variance-covariance matrix is identical to the lower bound given by the multiparameter case of the Information Inequality.
Joint confidence regions can be constructed based on one of the asymptotic results or Confidence intervals for a single parameter, say , can be based on the approximate normality of or where is the entry in the vector and is the entry in the matrix .
Suppose we observe two independent random variables having normal distributions with the same variance and means respectively. In this case, although the means depend on the parameter , the value of this vector parameter is unidentifiable in the sense that, for some pairs of distinct parameter values, the probability density function of the observations are identical. For example the parameter leads to exactly the same joint distribution of as does the parameter . In this case, we we might consider only the two parameters and anything derivable from this pair estimable, while parameters such as that cannot be obtained as functions of are consequently unidentifiable. The solution to the original identifiability problem is the reparametrization to the new parameter in this case, and in general, unidentifiability usually means one should seek a new, more parsimonious parametrization.
In the above example, compute the Fisher information matrix for the parameter . Notice that the Fisher information matrix is singular. This means that if you were to attempt to compute the asymptotic variance of the maximum likelihood estimator of by inverting the Fisher information matrix, the inversion would be impossible. Attempting to invert a singular matrix is like attempting to invert the number 0. It results in one or more components that you can consider to be infinite. Arguing intuitively, the asymptotic variance of the maximum likelihood estimator of some of the parameters is infinite. This is an indication that asymptotically, at least, some of the parameters may not be identifiable. When parameters are unidentifiable, the Fisher information matrix is generally singular. However, when is singular for all values of , this may or may not mean parameters are unidentifiable for finite sample sizes, but it does usually mean one should take a careful look at the parameters with a possible view to adopting another parametrization.
Which of the two main types of estimators should we use? There is no general
consensus among statisticians.
If we are estimating the expectation of a natural sufficient statistic in a regular exponential family both maximum likelihood and unbiasedness considerations lead to the use of as an estimator.
When sample sizes are large U.M.V.U.E's and maximum likelihood estimators are essentially the same. In that case use is governed by ease of computation. Unfortunately how large ``large'' needs to be is usually unknown. Some studies have been carried out comparing the behaviour of U.M.V.U.E.'s and maximum likelihood estimators for various small fixed sample sizes. The results are, as might be expected, inconclusive.
maximum likelihood estimators exist ``more frequently'' and when they do they are usually easier to compute than U.M.V.U.E.'s. This is essentially because of the appealing invariance property of maximum likelihood estimators.
Simple examples are known for which maximum likelihood estimators behave badly even for large samples. This is more often the case when there is a large number of parameters, some of which, termed ``nuisance parameters'' are of no direct interest, but complicate the estimation.
U.M.V.U.E.'s and maximum likelihood estimators are not necessarily robust. A small change in the underlying distribution or the data could result in a large change in the estimator.
The problem of finding best unbiased estimators is considerably simpler if we limit the class in which we search. If we permit any function of the data, then we usually require the heavy machinery of complete sufficiency to produce U.M.V.U.E.'s. However, the situation is much simpler if we suggest some initial random variables and then require that our estimator be a linear combination of these. Suppose, for example we have random variables with where is the parameter of interest and is another parameter. What linear combinations of the 's provide an unbiased estimator of and among these possible linear combinations which one has the smallest possible variance? To answer these questions, we need to know the covariances (at least up to some scalar multiple). Suppose and var. Let and We can write the model in a form reminiscent of linear regression as where and the 's are uncorrelated random variables with and var. Then the linear combination of the components of that has the smallest variance among all unbiased estimators of is given by the usual regression formula and provides the best estimator of in the sense of smallest variance. In other words, the linear combination of the components of which has smallest variance among all unbiased estimators of is where . In the above example, we may compute the Fisher information matrix for the parameter as follows.
The log likelihood is and the Fisher information is the covariance matrix of the score vector and this is Notice that is, in this case, singular. If you were to attempt to compute the asymptotic variance of the maximum likleihood estimator of by inverting this information matrix, the inversion is impossible. Attempting to invert a singular matrix is like attempting the inverse of , one or more components of the inverse can be taken to be infinite, indicating that, asymptotically at least, one of more of the parameters is unidentifiable.
More generally, we wish to consider a number of possibly dependent random variables whose expectations may be related to a parameter . These may, for example, be individual observations or a number of competing estimators constructed from these observations. We assume has expectation given by where is some matrix having rank and is a vector of unknown parameters. As in multiple regression, the matrix is known and non-random. Suppose the covariance matrix of is with a known non-singular matrix and a possibly unknown scalar parameter. We wish to estimate a linear combination of the components of , say where is a known -dimensional column vector. We restrict our attention to unbiased estimators of .
Theorem B11: Gauss-Markov Theorem
Suppose is a random vector with mean and covariance matrix where matrices and are known and the parameters and unknown. Suppose we wish to estimate a linear combination of the components of . Then among all linear combinations of the components of which are unbiased estimators of the parameter the estimator has the smallest variance.
Note that this result does not depend on any assumed normality of the components of but only on the first and second moment behaviour, that is, the mean and the covariances. The special case when is the identity matrix is the least squares estimator.
To find the maximum likelihood estimator, we usually solve the likelihood equation Note that the function on the left hand side is a function of both the observations and the parameter. Such a function is called an estimating function. Most sensible estimators, like the maximum likelihood estimator, can be described easily through an estimating function. For example, if we know var for independent identically distributed , then we can use the estimating function to estimate the parameter , without any other knowledge of the distribution, its density, mean etc. The estimating function is set equal to 0 and solved for . The above estimating function is an unbiased estimating function in the sense that This allows us to conclude that the function is at least centered appropriately for the estimation of the parameter . Now suppose that is an unbiased estimating function corresponding to a large sample. Often it can be written as the sum of independent components, for example Now suppose is a root of the estimating equation Then for sufficiently close to , Uusing the Central Limit Theorem, assuming that is the true value of the parameter and provided is a sum as in (B3.5), the left hand side of (ef2) is approximately normal with mean and variance equal to var. The term is also a sum of similar derivatives of the individual . If a law of large numbers applies to these terms, then when divided by this sum will be asymptotically equivalent to . It follows that the root will have an approximate normal distribution with mean and variance By analogy with the relation between asymptotic variance of the maximum likelihood estimator and the Fisher information, we call the reciprocal of the above asymptotic variance formula the Godambe information of the estimating function. This information measure is Godambe(1960) proved the following result.
Among all unbiased estimating functions satisfying the usual regularity conditions, an estimating function which maximizes the Godambe information (B3.6) is of the form where is non-random.
There are two major schools of thought on the way in which statistical inference is conducted, the frequentist and the Bayesian school. Typically, these schools differ slightly on the actual methodology and the conclusions that are reached, but more substantially on the philosophy underlying the treatment of parameters. So far we have considered a parameter as an unknown constant underlying or indexing the probability density function of the data. It is only the data, and statistics derived from the data that are random.
The Bayesian begins with the assertion that the parameter obtains as the realization of some larger random experiment. The parameter is assumed to have been generated according to some distribution, the prior distribution and the observations then obtained from the corresponding probability density function interpreted as the conditional probability density of the data given the value of . The prior distribution quantifies information about prior to any further data being gathered. Sometimes can be constructed on the basis of past data. For example, if a quality inspection program has been running for some time, the distribution of the number of defectives in past batches can be used as the prior distribution for the number of defectives in a future batch. The prior can also be chosen to incorporate subjective information based on an expert's experience and personal judgement. The purpose of the data is then to adjust this distribution for in the light of the data, to result in the posterior distribution for the parameter. Any conclusions about the plausible value of the parameter are to be drawn from the posterior distribution. For a frequentist, statements like are meaningless; all randomness lies in the data and the parameter is an unknown constant. Frequentists are careful to assure students that if an observed 95% confidence interval for the parameter is this does not imply . However, a Bayesian will happily quote such a probability, usually conditionally on some observations, for example, . In spite of some distance in the philosophy regarding the (random?) nature of statistical parameters, the two paradigms tend to largely agree for large sample sizes because the prior assumptions of the Bayesian tend to be a small contributor to the conclusion.
Suppose the parameter is initially chosen at random according to the prior distribution and then given the value of the parameter the observations are independent identically distributed, each with conditional probability (density) function . Then the posterior distribution of the parameter is the conditional distribution of given the data where is independent of and is the likelihood function. Since Bayesian inference is based on the posterior distribution it depends only on the data through the likelihood function.
Suppose a coin is tossed times with probability of heads . It is known from my ``very considerable previous experience with coins'' that the prior probability of heads is not always identically but follows a BETA distribution. If the tosses result in heads, we wish to find the posterior density function for . In this case the prior distribution for the parameter is the Beta(10,10) distribution with probability density function The posterior distribution of is therefore proportional to where the constant may depend on but dos not depend on Therefore the posterior distribution is also a Beta distribution but with parameters Notice that the posterior mean is the expected value of this beta distribution and is which, for and sufficiently large, is reasonably close to the usual estimator
If a prior distribution has the property that the posterior distribution is in the same family of distributions as the prior then the prior is called a conjugate prior.
Suppose is a random sample from the exponential family and is assumed to have the prior distribution with parameters given by where Then the posterior distribution of , given the data is easily seen to be given by where Notice that the posterior distribution is in the same family of distributions as (3.8) and thus is a conjugate prior. The value of the parameters of the posterior distribution reflect the choice of parameters in the prior.
To find the conjugate prior for ( for a random sample from the beta( distribution with probability density function we begin by writing this in exponential family form, Then the conjugate prior distribution is the joint probability density function on ( which is proportional to for parameters The posterior distribution takes the same form as (priorbeta) but with the parameters replaced by Bayesians are sometimes criticised for allowing their subjective opinions (in this case leading to the choice of the prior parameters influence the resulting inference but notice that in this case, and more generally, as the sample size grows, the value of the parameters of the posterior distribution is mostly determined by the components above which grow in eventually washing out the influence of the choice of prior parameters.
The choice of the prior distribution to be the conjugate prior is often motivated by mathematical convenience. However, a Bayesian would also like the prior to accurately represent the preliminary uncertainty about the plausible values of the parameter, and this may not be easily translated into one of the conjugate prior distributions. Noninformative priors are the usual way of representing ignorance about and they are frequently used in practice. It can be argued that they are more objective than a subjectively assessed prior distribution since the latter may contain personal bias as well as background knowledge. Also, in some applications the amount of prior information available is far less than the information contained in the data. In this case there seems little point in worrying about a precise specification of the prior distribution.
In the coin tossing example above, we assumed a Beta(10,10) prior distribution for the probability of heads. If were no reason to prefer one value of over any other then a noninformative or `flat' prior disribution for that could be used is the UNIF distribution, also as it turns out a special case of the beta distribution. Ignorance may not be bliss but for Bayesians it is most often uniformly distributed. For estimating the mean of a N distribution the possible values for are . If we take the prior distribution to be uniform on , that is, then this is not a proper probability density since Prior densities of this type are called improper priors. In this case we could consider a sequence of prior distributions such as the UNIF which approximates this prior as . Suppose we call such a prior density function . Then the posterior distribution of the parameter is given by and it is easy to see that as , this approaches a constant multiple of the likelihood function . For reasonably large sample size, is often integrable and can therefore be normalized to produce a proper posterior distribution, even though the corresponding prior was improper. This Bayesian development provides an alternate interpretation of the likelihood function. We can consider it as proportional to the posterior distribution of the parameter when using a uniform improper prior on the whole real line. The language is somewhat sloppy here since, as we have seen, the uniform distribution on the whole real line really makes sense only through taking limits for uniform distributions on finite intervals.
In the case of a scale parameter, which must take positive values such as the normal variance, it is usual to express ignorance of the prior distribution of the parameter by assuming that the logarithm of the parameter is uniform on the real line.
One possible difficulty with using nonformative prior distributions is the concern whether the prior distribution should be uniform for itself or some function of , such as or The objective when we used a uniform prior for a probability was to add no more information about the parameter around one possible value than around some other, and so it makes sense to use a uniform prior for a parameter that essentially has uniform information attached to it. For this reason, it is common to use a uniform prior for where is the function of whose Fisher information, is constant. This idea is due to Jeffreys and leads to a prior distribution which is proportional to Such a prior is referred to as a Jeffreys' prior. The reparametrization which leads to a Jeffrey's prior can be carried out as follows: suppose is a regular model and is the Fisher information for a single observation. Then if we choose an abitrary value for and define the reparameterization Then in this case, the Fisher information for the parameter , equals one for all values of and so Jeffry's prior corresponds to using a uniform prior distribution on the values of Since the asymptotic variance of the maximum likelihood estimator is equal to , which does not depend on (3.9) is often called a variance stabilizing transformation.
One method of obtaining a point estimator of is to use the posterior distribution and a suitable loss function.
The Bayes estimator of for squared error loss with respect to the prior given data is the mean of the posterior distribution given by This estimator minimizes
Suppose is a random sample from the distribution with probability density function Using a conjugate prior for find the Bayes estimator of for squared error loss.
We begin by identifying the conjugate prior distribution. Since the conjugate prior density is which is evidently a Gamma distribution restricted to the interval and if the prior is to be proper, the parameters must be chosen such that so Then the posterior distribution takes the same form as the prior but with replaced by and by The Bayes estimate of for squared error loss is the mean of this posterior distribution, or
There remains, after many decades, a controversy between Bayesians and frequentists about which approach to estimation is more suitable to the real world. The Bayesian has advantages at least in the ease of interpretation of the results. For example, a Bayesian can use the posterior distribution given the data to determine points such that and then give a Bayesian confidence interval for the parameter. If this results in the interval the Bayesian will state that (in a Bayesian model, subject to the validity of the prior) the conditional probability given the data that the parameter falls in the interval is . No such probability can be ascribed to a confidence interval for frequentists, who see no randomness in the parameter to which this probability statement is supposed to apply. Bayesian confidence regions are also called credible regions in order to make clear the distinction between the interpretation of Bayesian confidence regions and frequentist confidence regions.
Suppose is the posterior distribution of given the data and is a subset of . If then is called a credible region for A credible region can be formed in many ways. If is an interval such that then is called a equal-tailed credible region. A highest posterior density (H.P.D.) credible region is constructed in a manner similar to likelihood regions. The highest posterior density. credible region is given by where is chosen such that A highest posterior density credible region is optimal in the sense that it is the shortest credible interval for a given value of .
Suppose
is a random sample from the
N
distribution where
is known and
has the conjugate prior. Find the
H.P.D. credible region for
.
Compare this to a
C.I. for
Suppose the prior distribution for
is
so the prior density is given by
and
the posterior density by
where
the constants
and
depend on
but not on
and where
Therefore
the posterior distribution of
is
It follows that the 0.95 H.P.D. credible region is of the form
Notice
that as
the weight
and so
is asymptotically equivalent to the sample mean
Similarly, as
is asymptotically equivalent to
.
This means that for large values of
the H.P.D. region is close to the region
and
the latter is the 95% confidence interval for
based on the normal distribution of the maximum likelihood estimator
Finally, although statisticians argue whether the Bayesian or the frequentist approach is better, there is really no one right way to do statistics. There is something fundamentalist about the Bayesian paradigm, (though the Reverand Bayes was, as far as we know, far from a fundamentalist) in that it places all objects, parameters and data, in much the same context and treats them similarly. It is a coherent philosophy of statistics, and a Bayesian will vigorously argue that there is an inconsistency in regarding some unknowns as random and others deterministic. There are certainly instances in which a Bayesian approach seems more sensible-- particularly for example if the parameter is a measurement on a possibly randomly chosen individual (say the expected total annual claim of a client of an insurance company).
Statistical estimation usually concerns the estimation of the value of a parameter when we know little about it except perhaps that it lies in a given parameter space, and when we have no a priori reason to prefer one value of the parameter over another. If, however, we are asked to decide between two possible values of the parameter, the consequences of one choice of the parameter value may be quite different from another choice. For example, if we believe is normally distributed with mean and variance for some explanatory variables , then the value means there is no relation between and . We need neither collect the values of nor build a model around them. Thus the two choices and are quite different in their consequences. This is often the case.
A hypothesis test involves a (usually natural) separation of the parameter space into two disjoint regions, and . By the difference between the two sets we mean those points in the former that are not in the latter . This partition of the parameter space corresponds to testing the null hypothesis that the parameter is in . We usually write this hypothesis in the form The null hypothesis is usually the status quo. For example in a test of a new drug, the null hypothesis would be that the drug had no effect, or no more of an effect than drugs already on the market. The null hypothesis is only rejected if there is reasonably strong evidence against it. The alternative hypothesis determines what departures from the null hypothesis are anticipated. In this case, it might be simply Since we do not know the true value of the parameter, we must base our decision on the observed value of . The hypothesis test is conducted by determining a partition of the sample space into two sets, the critical or rejection region and its complement which is called the acceptance region. We declare that is false (in favour of the alternative) if we observe .
The power function of a test with critical region is the function or the probability that the null hypothesis is rejected as a function of the parameter.
It is obviously desirable, in order to minimize the two types of possible errors in our decision, for the power function to be small for but large otherwise. The probability of rejecting the null hypothesis when it is true (type I error) is a particularly important type of error which we attempt to minimize. This probability determines one important measure of the performance of a test, the level of significance.
A test has level of significance if for all .
The level of significance is simply an upper bound on the probability of a type I error. There is no assurance that the upper bound is tight, that is, that equality is achieved somewhere. The lowest such upper bound is often called the size of the test.
The size of a test is equal to .
Tests are often constructed by specifying the size of the test, which in turn determines the probability of the type I error, and then attempting to minimize the probability that the null hypothesis is accepted when it is false (type II error). Equivalently, we try and maximize the power function of the test for .
A test with power function is a uniformly most powerful (U.M.P.) test of size if, for all other tests of the same size having power function , we have for all .
The word ``uniformly'' above refers to the fact that one function dominates another, that is, uniformly for all . When the alternative consists of a single point then the construction of a best test is particularly easy. In this case, we may drop the word ``uniformly'' and refer to a ``most powerful test''. The construction of a best test, by this definition, is possible under rather special circumstances. First, we often require a simple null hypothesis. This is the case when consists of a single point and so we are testing the null hypothesis .
Let
have probability (density) function
.
Consider testing a simple null hypothesis
against a simple alternative
.
For a constant
,
suppose the critical region defined by
corresponds to a test of size
.
Then the test with this critical region is a most powerful test of size
for testing
against
.
Proof:
Consider another critical region with the same size. Then Therefore and
For , and thus For , and thus
Now and Therefore, using (4.1), (4.2), and (4.3) we have and the test with critical region is therefore the most powerful.
Suppose we anticipate collecting daily returns from the past days of a stock, assumed to be distributed according to a Normal distribution. Here is the length of a day measured in years, and are the annual drift and volatility parameters. We wish to test whether the stock has zero or positive drift, so we wish to test the hypothesis against the alternative at level of significance . We want the probability of the incorrect decision when the drift is 20% per year to be small, so let us choose it to be as well, which means that when the power of the test should be at least How large a sample must be taken in order to insure this?
The test itself is easy to express. We reject the null hypothesis if where the value has been chosen so that when has a standard normal distribution. The power of the test is the probability when the parameter and this is where has a standard normal distribution. Since we want the power to be the value must be chosen to be Solving for the value of Now if we try some reasonable values for the parameters, for example , then which is about 55 years worth of data, far larger a sample than we could hope to collect. This example shows that the typical variabilities in the market are so large, compared with even fairly high rates of return, that it is almost impossible to distinguish between theoretical rates of return of 0% and 20% per annum using a hypothesis test with daily data.
There is a close relationship between hypothesis tests and confidence intervals as the following example illustrates. Suppose is a random sample from the N(,1) distribution and we wish to test the hypothesis against . The critical region is a size critical region which has a corresponding acceptance region Note that the hypothesis would not be rejected at the level if or equivalently which is a C.I. for
Let be a random sample from the Gamma distribution. Show that is a size critical region for testing . Show how this critical region may be used to construct a C.I. for
Consider a test of the hypothesis against . We have seen that for prescribed , the most powerful test of the simple null hypothesis against a simple alternative is based on the likelihood ratio . By the Neyman-Pearson Lemma it has critical region whre is a constant determined by the size of the test. When either the null or the alternative hypothesis are composite (i.e. contain more than one point) and there is no uniformly most powerful test, it seems reasonable to use a test with critical region for some choice of . The likelihood ratio test does this with replaced by , the maximum likelihood estimator over all possible values of the parameter, and replaced by the maximum likelihood estimator of the parameter when it is restricted to . Thus, the likelihood ratio test has critical region where
and is determined by the size of the test. In general, the distribution of the test statistic may be difficult to find. Fortunately, however, the asymptotic distribution is known under fairly general conditions. In a few cases, we can show that the likelihood ratio test is equivalent to the use of a statistic with known distribution. However, in many cases, we need to rely on the asymptotic chi-squared distribution of Theorem 4.4.6.
Let be a random sample from the N distribution where and are unknown. Consider a test of against the alternative We can show that the likelihood ratio test of against has critical region . Under that the statistic has a F distribution and we can thus find a size test for .
Suppose is a random sample from a regular statistical model with an open set in dimensional Euclidean space. Consider a subset of defined by open subset of -dimensional Euclidean space . Then the likelihood ratio statistic defined by is such that, under the hypothesis , Note: The number of degrees of freedom is the difference between the number of parameters that need to be estimated in the general model, and the number of parameters left to be estimated under the restrictions imposed by .
We have seen that a test of hypothesis is a rule which allows us to decide whether to accept the null hypothesis or to reject it in favour of the alternative hypothesis based on the observed data. In situations in which is difficult to specify a test of significance could be used. A (pure) test of significance is a procedure for measuring the strength of the evidence provided by the observed data against . This method usually involves looking at the distribution of a test statisitic or discrepancy measure under The p-value or significance level for the test is the probability, computed under of observing a value at least as extreme as the value observed. The smaller the observed p-value, the stronger the evidence against . The difficulty with this approach is how to find a statistic with `good properties'. The likelihood ratio statistic provides a general test statistic which may be used.
Score tests can be viewed as a more general class of tests of against which tend to have considerable power provided that the values of the parameter under the null and the alternative are close. If the usual regularity conditions hold then under we have and thus For a vector we have The test based on is called a (Rao) score test. It has critical region where is determined by the size of the test, that is, satisfies where The test based on is asymptotically equivalent to the likelihood ratio test.
Suppose that is the maximum likelihood estimator of over all and we wish to test against If the usual regularity conditions hold then under A test based on the test statistic is called a Wald test. It has critical region where is determined by the size of the test. Both the score test and the Wald test are asymptotically equivalent to the likelihood ratio test and the intuitive expanation for these equivalences are quite simple. For large values of the sample size the maximum likelihood estimator is close to the true value of the parameter and so the log likelihood can be approximated by the first two terms in the Taylor series expansion of about and so since and the observed information is asymptotically equivalent to the Fisher information This verifies the equivalence of the likelihood ratio and the Wald test.