7. Expectation, Averages, Variability

Summarizing Data on Random Variables

When we return midterm tests, someone almost always asks what the average was. While we could list out all marks to give a picture of how students performed, this would be tedious. It would also give more detail than could be immediately digested. If we summarize the results by telling a class the average mark, students immediately get a sense of how well the class performed. For this reason, "summary statistics" are often more helpful than giving full details of every outcome.

To illustrate some of the ideas involved, suppose we were to observe cars crossing a toll bridge, and record the number, , of people in each car. Suppose in a small study data on 25 cars were collected. We could list out all 25 numbers observed, but a more helpful way of presenting the data would be in terms of the frequency distribution below, which gives the number of times (the ``frequency'') each value of occurred.
p102a.ps

We could also draw a frequency histogram of these frequencies:

p102.ps

Frequency Histogram

Frequency distributions or histograms are good summaries of data because they show the variability in the observed outcomes very clearly. Sometimes, however, we might prefer a single-number summary. The most common such summary is the average, or arithmetic mean of the outcomes. The mean of outcomes $x_{1},\dots,x_{n}$ for a random variable is , and is denoted by $\bar{x}$ . The arithmetic mean for the example above can be calculated as MATH That is, there was an average of 2.6 persons per car. A set of observed outcomes $x_{1},\dots,x_{n}$ for a random variable is termed a sample in probability and statistics. To reflect the fact that this is the average for a particular sample, we refer to it as the sample mean. Unless somebody deliberately "cooked" the study, we would not expect to get precisely the same sample mean if we repeated it another time. Note also that $\bar{x}$ is not in general an integer, even though is.

Two other common summary statistics are the median and mode.

Definition

The median of a sample is a value such that half the results are below it and half above it, when the results are arranged in numerical order.

If these 25 results were written in order, the $13^{\QTR{rm}{th}}$ outcome would be a 2. So the median is 2. By convention, we go half way between the middle two values if there are an even number of observations.

Definition

The mode of the sample is the value which occurs most often. In this case the mode is 2. There is no guarantee there will be only a single mode.

Expectation of a Random Variable

The statistics in the preceding section summarize features of a sample of observed -values. The same idea can be used to summarize the probability distribution of a random variable . To illustrate, consider the previous example, where is the number of persons in a randomly selected car crossing a toll bridge.

Note that we can re-arrange the expression used to calculate $\overline{x}$ for the sample, as MATH MATH Now suppose we know that the probability function of is given by

1 2 3 4 5 6

.30 .25 .20 .15 .09 .01

	1	2	3	4	5	6
	.30	.25	.20	.15	.09	.01

Using the relative frequency ``definition'' of probability, if we observed a very large number of cars, the fraction (or relative frequency) of times would be .30, for , this proportion would be .25, etc. So, in theory, (according to the probability model) we would expect the mean to be MATH if we observed an infinite number of cars. This ``theoretical'' mean is usually denoted by $\mu$ or , and requires us to know the distribution of . With this background we make the following mathematical definition.

Definition

The expectation (also called the mean or the expected value) of a discrete random variable with probability function is MATH

The expectation of is also often denoted by the Greek letter $\mu$ . The expectation of can be thought of physically as the average of the -values that would occur in an infinite series of repetitions of the process where is defined. This value not only describes one aspect of a probability distribution, but is also very important in certain types of applications. For example, if you are playing a casino game in which represents the amount you win in a single play, then represents your average winnings (or losses!) per play.

Sometimes we may not be interested in the average value of itself, but in some function of . Consider the toll bridge example once again, and suppose there is a toll which depends on the number of car occupants. For example, a toll of $1 per car plus 25 cents per occupant would produce an average toll for the 25 cars in the study of Section 7.1 equal to MATH If has the theoretical probability function given above, then the average value of this $(.25X + 1) toll would be defined in the same way, as,

MATH

We call this the expectation of and write .

As a further illustration, suppose a toll designed to encourage car pooling charged $\$12/x^{2}$ if there were people in the car. This scheme would yield an average toll, in theory, of

MATH

that is, MATH is the ``expectation'' of .

With this as background, we can now make a formal definition.

Definition

The expectation of some function of a random variable with probability function is MATH

Notes:

You can interpret as the average value of in an infinite series of repetitions of the process where is defined.
is also known as the ``expected value'' of . This name is somewhat misleading since the average value of may be a value which never takes - hence unexpected!
The case where reduces to our earlier definition of .
Confusion sometimes arises because we have two notations for the mean of a probability distribution: $\mu$ and mean the same thing. There is nothing wrong with this.
When calculating expectations, look at your answer to be sure it makes sense. If takes values from 1 to 10, you should know you've made an error if you get or . In physical terms, is the balance point for the histogram of .

Let us note a couple of mathematical properties of expectation that can help to simplify calculations.

Properties of Expectation:

If your linear algebra is good, it may help if you think of as being a linear operator. Otherwise you'll have to remember these and subsequent properties.

For constants and ,

Proof:
For constants and and functions $g_{1}$ and $g_{2}$ , it is also easy to show Don't let expectation intimidate you. Much of it is common sense. For example, property 1 says if we let and . The expectation of a constant is of course equal to . It also says if we let , , and . This is obvious also. Note, however, that for a nonlinear function, it is not true that ; this is a common mistake. (Check this for the example above when $g(X)=12/X^{2}$ .)

Some Applications of Expectation

Because expectation is an average value, it is frequently used in problems where costs or profits are connected with the outcomes of a random variable . It is also used a lot as a summary statistic for probability distributions; for example, one often hears about the expected life (expectation of lifetime) for a person or the expected return on an investment.

The following are examples.

Example: Expected Winnings in a Lottery

A small lottery sells 1000 tickets numbered $000,001,\dots,999$ ; the tickets cost $10 each. When all the tickets have been sold the draw takes place: this consists of a simple ticket from 000 to 999 being chosen at random. For ticket holders the prize structure is as follows:

Your ticket is drawn - win $5000.
Your ticket has the same first two number as the winning ticket, but the third is different - win $100.
Your ticket has the same first number as the winning ticket, but the second number is different - win $10.
All other cases - win nothing.

Let the random variable represent the winnings from a given ticket. Find .

Solution: The possible values for are 0, 10, 100, 5000 (dollars). First, we need to find the probability function for . We find (make sure you can do this) that has values MATH

The expected winnings are thus the expectation of , or MATH Thus, the gross expected winnings per ticket are $6.80. However, since a ticket costs $10 your expected net winnings are negative, -$3.20 (i.e. an expected loss of $3.20).

Remark: For any lottery or game of chance the expected net winnings per play is a key value. A fair game is one for which this value is 0. Needless to say, casino games and lotteries are never fair: the net winnings for a player are negative.

Remark: The random variable associated with a given problem may be defined in different ways but the expected winnings will remain the same. For example, instead of defining as the amount won we could have defined as follows:

all 3 digits of number match winning ticket

1st 2 digits (only) match

1st digit (but not the 2nd) match

1st digit does not match

Now, we would define the function as the winnings when the outcome occurs. Thus,

MATH

The expected winnings are then MATH the same as before.

Example: Diagnostic Medical Tests

Often there are cheaper, less accurate tests for diagnosing the presence of some conditions in a person, along with more expensive, accurate tests. Suppose we have two cheap tests and one expensive test, with the following characteristics. All three tests are positive if a person has the condition (there are no "false negatives"), but the cheap tests give "false positives".

Let a person be chosen at random, and let = {person has the condition}. The three tests are

Test 1: (positive test $|\overline{D})$ = .05; test costs $5.00

Test 2: (positive test $|\overline{D})$ = .03; test costs $8.00

Test 3: (positive test $|\overline{D})$ = 0; test costs $40.00

We want to check a large number of people for the condition, and have to choose among three testing strategies:

Use Test 1, followed by Test 3 if Test 1 is positive.
Use Test 2, followed by Test 3 if Test 2 is positive.
Use Test 3.

Determine the expected cost per person under each of strategies (i), (ii) and (iii). We will then choose the strategy with the lowest expected cost. It is known that about .001 of the population have the condition (i.e. ).

Solution: Define the random variable as follows (for a random person who is tested):

if the initial test is negative

if the initial test is positive

Also let be the total cost of testing the person. The expected cost per person is then MATH

The probability function for and function differ for strategies (i), (ii) and (iii). Consider for example strategy (i). Then MATH The rest if the probabilities, associated values of and are obtained below.

MATH

Thus, its cheapest to use strategy (i).

Problem:

A lottery has tickets numbered 000 to 999 which are sold for $1 each. One ticket is selected at random and a prize of $200 is given to any person whose ticket number is a permutation of the selected ticket number. All 1000 tickets are sold. What is the expected profit or loss to the organization running the lottery?

Means and Variances of Distributions

Its useful to know the means, $\mu=E(X)$ of probability models derived in Chapter 6.

Example: (Mean of binomial distribution)

Let $X\sim Bi(n,p)$ . Find .

Solution: MATH

When the value of the expression is 0. We can therefore begin our sum at . Provided $x\neq0$ , we can expand as (so it is important to eliminate the term when ). MATH MATH Let in the sum, to get MATH

Exercise: Does this result make sense? If you try something 100 times and there is a 20% chance of success each time, how many successes do you expect to get, on average?

Example: (Mean of the Poisson distribution)

Let have a Poisson distribution where $\lambda$ is the average rate of occurrence and the time interval is of length . Find $\mu=E(X)$ .

Solution:

The probability function of is . Then .

As in the binomial example, we can eliminate the term when and expand as for . MATH Let in the sum. MATH MATH Note that we used the symbol $\mu=\lambda t$ earlier in connection with the Poisson model; this was because we knew (but couldn't show until now) that $E(X)=\mu$ .

Exercise: These techniques can also be used to work out the mean for the hypergeometric or negative binomial distributions. Looking back at how we proved that $\sum f(x)=1$ shows the same method of summation used to find $\mu$ . However, in Chapter 8 we will give a simpler method of finding the means of these distributions, which are (hypergeometric) and (negative binomial).

Variability:

While an average is a useful summary of a set of observations, or of a probability distribution, it omits another important piece of information, namely the amount of variability. For example, it would be possible for car doors to be the right width, on average, and still have no doors fit properly. In the case of fitting car doors, we would also want the door widths to all be close to this correct average. We give a way of measuring the amount of variability next. You might think we could use the average difference between and $\mu$ to indicate the amount of variation. In terms of expectation, this would be . However, (since $\mu$ is a constant) = 0. We soon realize that to measure variability we need a function that is the same sign for $X>\mu$ and for $X<\mu$ . We now define

Definition

The variance of a r.v is , and is denoted by $\sigma^{2}$ or by Var .

In words, the variance is the average square of the distance from the mean. This turns out to be a very useful measure of the variability of .

Example: Let be the number of heads when a fair coin is tossed 4 times. Then so . Without doing any calculations we know $\sigma^{2}\leq4$ because is always between 0 and 4. Hence it can never be further away from $\mu$ than 2. This makes the average square of the distance from $\mu$ at most 4. The values of are

0 1 2 3 4 since =

1/16 4/16 6/16 4/16 1/16 =

	0	1	2	3	4	since	=
	1/16	4/16	6/16	4/16	1/16		=

The value of (i.e. $\sigma^{2}$ ) is easily found here: MATH

If we keep track of units of measurement the variance will be in peculiar units; e.g. if is the number of heads in 4 tosses of a coin, $\sigma^{2}$ is in units of heads $^{2}$ ! We can regain the original units by taking (positive) . This is called the standard deviation of , and is denoted by $\sigma$ , or as .

Definition

The standard deviation of a random variable is

Both variance and standard deviation are commonly used to measure variability.

The basic definition of variance is often awkward to use for mathematical calculation of $\sigma^{2}$ , whereas the following two results are often useful:

MATH

Proof:

Using properties of expectation,

(2) MATH

Formula (2) is most often used when there is an term in the denominator of . Otherwise, formula (1) is generally easier to use.

Example: (Variance of binomial distribution)

Let $X\sim Bi(n,p)$ . Find Var $(X)\bigskip$

Solution:

, so we'll use formula (2). MATH

If or the value of the term is 0, so we can begin summing at . For $x\neq0$ or , we can expand the as MATH Now re-group to fit the binomial theorem, since that was the summation technique used to show $\sum f(x)=1$ and to derive $\mu=np$ . MATH MATH Let in the sum, giving MATH MATH Then MATH Remember that the variance of a binomial distribution is , since we'll be using it later in the course.

Example: (Variance of Poisson distribution) Find the variance of the Poisson distribution.

Solution: MATH Let in the sum, giving MATH MATH (that is, for the Poisson, the variance equals the mean.)

Properties of Mean and Variance

If and are constants and , then

MATH

(where $\mu_{X}$ and $\sigma_{X}^{2}$ are the mean and variance of and $\mu_{Y}$ and $\sigma_{Y}^{2}$ are the mean and variance of ).

Proof:

We already showed that .

i.e. $\mu_{Y}=a\mu_{X}+b$ , and then MATH

This result is to be expected. Adding a constant, , to all values of has no effect on the amount of variability. So it makes sense that Var doesn't depend on the value of .

A simple way to relate to this result is to consider a random variable which represents a temperature in degrees Celsius (even though this is a continuous random variable which we don't study until Chapter 9). Now let be the corresponding temperature in degrees Fahrenheit. We know that MATH and it is clear if we think about it that and that .

Problems:

An airline knows that there is a 97% chance a passenger for a certain flight will show up, and assumes passengers arrive independently of each other. Tickets cost $100, but if a passenger shows up and can't be carried on the flight the airline has to refund the $100 and pay a penalty of $400 to each such passenger. How many tickets should they sell for a plane with 120 seats to maximize their expected ticket revenues after paying any penalty charges? Assume ticket holders who don't show up get a full refund for their unused ticket.
A typist typing at a constant speed of 60 words per minute makes a mistake in any particular word with probability .04, independently from word to word. Each incorrect word must be corrected; a task which takes 15 seconds per word.
1. Find the mean and variance of the time (in seconds) taken to finish a 450 word passage.
2. Would it be less time consuming, on average, to type at 45 words per minute if this reduces the probability of an error to .02?

Moment Generating Functions

We have now seen two functions which characterize a distribution, the probability function and the cumulative distribution function. There is a third type of function, the moment generating function, which uniquely determines a distribution. The moment generating function is closely related to other transforms used in mathematics, the Laplace and Fourier transforms.

Definition

Consider a discrete random variable with probability function . The moment generating function (m.g.f.) of is defined as MATH assuming that this sum exists.

The moments of a random variable are the expectations of the functions $X^{r}$ for $r=1,2,\dots$ . The expectation $E(X^{r})$ is called $r^{\QTR{rm}{th}}$ moment of . The mean $\mu=E(X)$ is therefore the first moment, $E(X^{2})$ the second and so on. It is often easy to find the moments of a probability distribution mathematically by using the moment generating function. This often gives easier derivations of means and variances than the direct summation methods in the preceding section. The following theorem gives a useful property of m.g.f.'s.

Theorem

Let the random variable have m.g.f. . Then MATH where $M^{(r)}(0)$ stands for $d^{r}M(t)/dt^{r}$ evaluated at .

Proof:

and if the sum converges, then MATH

, as stated.

This sometimes gives a simple way to find the moments for a distribution.

Example 1. MATH

and so MATH From this we can also find MATH

Exercise. Poisson distribution

Show that the Poisson distribution with probability function MATH has m.g.f. . Then show that $E(X)=\lambda$ and $Var(X)=\lambda$ .

The m.g.f. also uniquely identifies a distribution in the sense that two different distributions cannot have the same m.g.f. This result is often used to find the distribution of a random variable.

The other feature of the moment generating function that is often used is that it can be used to identify a given distribution. If two random variables have the same moment generating function, they have the same distribution (so the same probability function, cumulative distribution function, moments, etc.) But in fact moment generating functions can also be used to determine that a sequence of distributions gets closer and closer to some limiting distribution. To show this, suppose that a sequence of probability functions $f_{n}(x)$ have corresponding moment generating functions MATH Suppose moreover that the probability functions $f_{n}(x)$ converge to another probability function pointwise in as $n\rightarrow\infty$ . This is what we mean by convergence of discrete distributions. Then since MATH which says that $M_{n}(t)$ converges to the moment generating function of the limiting distribution. It shouldn't be too surprising that a very useful converse to this result also holds:

Suppose $X_{n}$ has moment generating function $M_{n}(t)$ and for each such that $M(t)<\infty.$ For example we saw in Chapter 6 that a Binomial distribution with very large and very small is close to a Poisson distribution with parameter $\lambda=np.$ Consider the moment generating function of such a binomial random variable MATH Now consider the limit of this expression as Since in general MATH the limit of (binmgf) as $n\rightarrow\infty$ is MATH and this is the moment generating function of a Poisson distribution with parameter $\lambda.$ This shows a little more formally than we did earlier that the binomial distribution with $p=\lambda/n$ approaches the Poisson( $\lambda$ ) distribution as

Problems on Chapter 7

Let have probability function Find the mean and variance for .
A game is played where a fair coin is tossed until the first tail occurs. The probability tosses will be needed is . You win $\$2^{x}$ if tosses are needed for but lose $256 if . Determine your expected winnings.
Diagnostic tests. Consider diagnostic tests like those discussed above in the example of Section 7.3 and in Problem 15 for Chapter 4. Assume that for a randomly selected person, , , , so that the inexpensive test only gives false positive, and not false negative, results.
Suppose that this inexpensive test costs $10. If a person tests positive then they are also given a more expensive test, costing $100, which correctly identifies all persons with the disease. What is the expected cost per person if a population is tested for the disease using the inexpensive test followed, if necessary, by the expensive test?
Diagnostic tests II. Two percent of the population has a certain condition for which there are two diagnostic tests. Test A, which costs $1 per person, gives positive results for 80% of persons with the condition and for 5% of persons without the condition. Test B, which costs $100 per person, gives positive results for all persons with the condition and negative results for all persons without it.
1. - Suppose that test B is given to 150 persons, at a cost of $15,000. How many cases of the condition would one expect to detect?
  - Suppose that 2000 persons are given test A, and then only those who test positive are given test B. Show that the expected cost is $15,000 but that the expected number of cases detected is much larger than in part (a).
The probability that a roulette wheel stops on a red number is 18/37. For each bet on ``red'' you win the amount bet if the wheel stops on a red number, and lose your money if it does not.
1. - If you bet $1 on each of 10 consecutive plays, what is your expected winnings? What is your expected winnings if you bet $10 on a single play?
  - For each of the two cases in part (a), calculate the probability that you made a profit (that is, your ``winnings'' are positive, not negative).

Slot machines. Consider the slot machine discussed above in Problem 16 for Chapter 4. Suppose that the number of each type of symbol on wheels 1, 2 and 3 is as given below:

$\$ Wheel

Symbols 1 2 3

Flower 2 6 2

Dog 4 3 3

House 4 1 5

	Wheel

Symbols	1	2	3
Flower	2	6	2
Dog	4	3	3
House	4	1	5

If all three wheels stop on a flower, you win $20 for a $1 bet. If all three wheels stop on a dog, you win $10, and if all three stop on a house, you win $5. Otherwise you win nothing.

Find your expected winnings per dollar spent.

Suppose that people take a blood test for a disease, where each person has probability of having the disease, independent of other persons. To save time and money, blood samples from people are pooled and analyzed together. If none of the persons has the disease then the test will be negative, but otherwise it will be positive. If the pooled test is positive then each of the persons is tested separately (so tests are done in that case).
1. - Let be the number of tests required for a group of people. Show that
  - What is the expected number of tests required for groups of people each? If , evaluate this for the cases .
  - Show that if is small, the expected number of tests in part (b) is approximately $n(kp+k^{-1})$ , and is minimized for $k\doteq p^{-1/2}$ .
A manufacturer of car radios ships them to retailers in cartons of radios. The profit per radio is $59.50, less shipping cost of $25 per carton, so the profit is $ per carton. To promote sales by assuring high quality, the manufacturer promises to pay the retailer $\$200X^{2}$ if radios in the carton are defective. (The retailer is then responsible for repairing any defective radios.) Suppose radios are produced independently and that 5% of radios are defective. How many radios should be packed per carton to maximize expected net profit per carton?
Let have a geometric distribution with probability function
1. 1. Calculate the m.g.f. , where is a parameter.
  2. Find the mean and variance of .
  3. Use your result in (b) to show that if is the probability of ``success'' () in a sequence of Bernoulli trials, then the expected number of trials until the first occurs is . Explain why this is ``obvious''.
Analysis of Algorithms: Quicksort. Suppose we have a set of distinct numbers and we wish to sort them from smallest to largest. The quicksort algorithm works as follows: When it just compares the numbers and puts the smallest one first. For it starts by choosing a random "pivot" number from the numbers. It then compares each of the other numbers with the pivot and divides them into groups $S_{1}$ (numbers smaller than the pivot) and $\bar{S}_{1}$ ( numbers bigger than the pivot). It then does the same thing with $S_{1}$ and $\bar{S}_{1}$ as it did with , and repeats this recursively until the numbers are all sorted. (Try this out with, say numbers to see how it works.) In computer science it is common to analyze such algorithms by finding the expected number of comparisons (or other operations) needed to sort a list. Thus, let
1. Show that if is the number of comparisons needed,
2. Show that and thus that $C_{n}$ satisfies the recursion (note $C_{0}=C_{1}=0$ )
3. Show that
4. (Harder) Use the result of part (c) to show that for large , (Note: $a_{n}\sim b_{n}$ means as ) This proves a result from computer science which says that for Quicksort, .
Find the distributions that corresponds to the following moment-generating functions:
(a)
(b)
Find the moment generating function of the discrete uniform distribution on $\{a,a+1,...,b\}$ ; What do you get in the special case and in the case Use the moment generating function in these two cases to confirm the expected value and the variance of
Let be a random variable taking values in the set $\{0,1,2\}$ with moments , $E(X^{2})=3/2.$
1. Find the moment generating function of
2. Find the first six moments of $\ X$
3. Find
4. Show that any probability distribution on $\{0,1,2\}$ is completely determined by its first two moments.

		all 3 digits of number match winning ticket
		1st 2 digits (only) match
		1st digit (but not the 2nd) match
		1st digit does not match

Test 1:		(positive test $\|\overline{D})$ = .05; test costs $5.00
Test 2:		(positive test $\|\overline{D})$ = .03; test costs $8.00
Test 3:		(positive test $\|\overline{D})$ = 0; test costs $40.00

		if the initial test is negative
		if the initial test is positive