General Statistical Models

The time has come to learn some theory. This is a preview of STAT 5101–5102. We don’t need to learn much theory. We will proceed with the general strategy of all introductory statistics: don’t derive anything, just tell you stuff.

3.1 Probability Models

3.1.1 Kinds of Probability Theory

There are two kinds of probability theory. There is the kind you will learn in STAT 5101-5102 (or 4101–4102, which is more or less the same except for leaving out the multivariable stuff). The material covered goes back hundreds of years, some of it discovered in the early 1600’s. And there is the kind you would learn in MATH 8651–8652 if you could take it, which no undergraduate does (it is a very hard Ph. D. level math course). The material covered goes back to 1933. It is all “new math”. These two kinds of probability can be called classical and measure-theoretic, respectively.

The old kind (classical) is still very useful. For every 1000 people who know a lot of probability theory, 999 know only the classical theory. This includes a lot of working scientists. And it is a lot easier than the new kind (measure-theoretic), which is why we still teach it at the undergraduate and master’s level and Ph. D. level scientists in all fields. (Only math and stat Ph. D. students take the measure-theoretic probability course.) And Minnesota is not different from any university in this respect.

3.1.2 Classical Probability Models

In classical probability theory there are two kinds of probability models (also called probability distributions). They are called discrete and continuous. The fact that there are two kinds means everything has to be done twice, once for discrete, once for continuous.

3.1.2.1 Discrete Probability Models

A discrete probability model is specified by a finite set $S$ called the sample space and a real-valued function $f$ on the sample space called the probability mass function (PMF) of the model. A PMF satisfies two properties

\begin{matrix} � (�) \geq 0, � \in � \\ \sum_{� \in �} � (�) = 1 \end{matrix}

We say that

f (x)

is the probability of the outcome

x

. So the two properties say that probabilities are nonnegative and sum to one.

That should sound familiar, just like what they told you probability was in your intro statistics course. The only difference is that, now that you know calculus, $S$ can be an infinite set so the summation here can be an infinite sum.

In principle, the sample space can be any set, but all discrete probability models that are well known, have names, and are used in applications have sample spaces that are subsets of the integers. Here are a few examples.

3.1.2.1.1 The Binomial Distribution

The binomial distribution describes the number of successes in $n$ stochastically independent and identically distributed (IID) random process that can only have two outcomes, conventionally called success and failure, although they could be anything, the important point is that there are only two possible outcomes.

If $p$ is the probability of success in any single trial then the probability of $x$ successes in $n$ trials is

� (�) = (\binom{�}{�}) �^{�} (1 - �)^{� - �}, � = 0, 1, 2, \dots, � .

where

(\binom{�}{�}) = \frac{�!}{�! (� - �)!}

is called a binomial coefficient and gives the distribution its name. And this is the PMF of the binomial distribution.

The fact that probabilities sum to one is a special case of the binomial theorem

\sum_{� = 0}^{�} (\binom{�}{�}) �^{�} �^{� - �} = (� + �)^{�} .

Special cases are

\begin{aligned} (� + �)^{2} & = �^{2} + 2 � � + �^{2} \\ (� + �)^{3} & = �^{3} + 3 �^{2} � + 3 � �^{2} + �^{3} \\ (� + �)^{4} & = �^{4} + 4 �^{3} � + 6 �^{2} �^{2} + 4 � �^{3} + �^{4} \end{aligned}

and so for, which you may remember from high school algebra.

The binomial distribution is very important. It arises any time there are data with only two outcomes: yes or no, for or against, vanilla or chocolate, whatever.

There is a generalization called the multinomial distribution that allows for any (finite) number of outcomes. But we won’t bother with that. (It is the basis of STAT 5421, categorical data analysis.)

3.1.2.1.2 The Poisson Distribution

The Poisson distribution (named after a man named Poisson, it’s not about fish) describes the number of things in any part of a stochastic process where the locations of things are stochastically independent (one are affected by any of the others). Examples would be the number of winners of a lottery, the number of raisins in a slice of carrot cake, the number of red blood cells in a drop of blood, the number of visible stars in a region of the sky, the number of traffic accidents in Minneapolis today. It doesn’t matter what is counted, so long as the thingummies counted have nothing to do with each other, you get the Poisson distribution.

Its PMF is

� (�) = \frac{�^{�} �^{- �}}{�!}, � = 0, 1, 2, \dots,

where

μ

can be any positive real number (more on this later).

The fact that probabilities sum to one is a special case of the Maclaurin series (Taylor series around zero) of the exponential function

�^{�} = \sum_{� = 0}^{\infty} \frac{�^{�}}{�!}

Here the sample space is infinite, so the fact that probabilities sum to one involves an infinite series.

The Poisson distribution was initially derived from the binomial distribution. It is what you get when you let $p$ go to zero in the binomial PMF in such a way so that $n p \to μ$ . So the Poisson distribution is an approximation for the binomial distribution when $n$ is very large $p$ is very small, and $n p$ is moderate sized. This illustrates how one probability distribution can be derived from another.

3.1.2.1.3 The Zero-Truncated Poisson Distribution

We already met the zero-truncated Poisson distribution. This arises when you have a Poisson distribution except for zero counts. There may be other reasons why zero occurs other than Poisson variation; the chef may have forgotten the raisins in the recipe rather than your slice of carrot cake has no raisins for no other reason other than chance variation — that’s just the way things came out in the mixing of the batter and slicing of the cake.

The zero-truncated Poisson distribution is widely used in aster models, and we used it as an example of a function that requires extreme care if you want to calculate it accurately using computer arithmetic (supplementary notes).

The exact definition is that the zero-truncated Poisson distribution is what you get when you take Poisson data and throw out all the zero counts. So its PMF is the PMF of the Poisson distribution with zero removed from the sample space and all of the probabilities re-adjusted to sum to one.

For the Poisson distribution $f (0) = e^{- μ}$ so the probability of nonzero is $1 - e^{- μ}$ so the zero-truncated Poisson distribution has PMF

� (�) = \frac{�^{�} �^{- �}}{�! (1 - �^{- �})}, � = 1, 2, 3, \dots .

This is another illustration of how one probability distribution can be derived from another.

3.1.2.2 Univariate Continuous Probability Models

A univariate continuous probability model is specified by a a real-valued function $f$ of one real variable called the probability density function (PDF) of the model. A PDF satisfies two properties

\begin{matrix} � (�) \geq 0, - \infty < � < \infty \\ \int_{- \infty}^{\infty} � (�) � � = 1 \end{matrix}

We say that $f (x) d x$ is the probability of an outcome in the interval from $x$ to $x + d x$ when $d x$ is very small. For this to be exactly correct $d x$ has to be infinitesimal. To get the probability for a finite interval, one has to integrate

Pr (� < � < �) = \int_{�}^{�} � (�) � � .

That should sound familiar, just like what they told you probability was in your intro statistics course. Integrals are area under a curve. Probability is area under a curve (for continuous distributions).

3.1.2.2.1 The Normal Distribution

The normal distribution arises whenever one averages a large number of IID random variables (with one proviso, which we will discuss later). This is called the central limit theorem (CLT).

Its PDF is

� (�) = \frac{1}{\sqrt{2 �} �} �^{- (� - �)^{2} / (2 �^{2})}, - \infty < � < \infty .

The fact that this integrates to one is something they didn’t teach you in calculus of one variable (because the trick of doing it involves multivariable calculus, in particular, polar coordinates).

The special case when $μ = 0$ and $σ = 1$ is called the standard normal distribution. Its PDF is

� (�) = \frac{1}{\sqrt{2 �}} �^{- �^{2} / 2}, - \infty < � < \infty .

3.1.2.2.2 The Cauchy Distribution

The Cauchy distribution arises in no applications I know of. It is a mathematical curiosity mostly useful as a counterexample. Many things that are true of other distributions don’t hold for Cauchy. For example, the average of IID Cauchy random variables does not obey the CLT (more on this later). If $X$ and $Y$ are independent standard normal random variables, then $X / Y$ has a Cauchy distribution, but this is an operation that does not seem to arise in applications.

Its PDF is

� (�) = \frac{1}{� � [1 + {(\frac{� - �}{�})}^{2}]}, - \infty < � < \infty .

The special case when

μ = 0

and

σ = 1

is called the standard Cauchy distribution. Its PDF is

� (�) = \frac{1}{� (1 + �^{2})}, - \infty < � < \infty .

The fact that these integrate to one involves, firstly, change-of-variable, the substitution $z = (x - μ) / σ$ establishing that

\frac{1}{� �} \int_{- \infty}^{\infty} \frac{1}{1 + {(\frac{� - �}{�})}^{2}} � � = \frac{1}{�} \int_{- \infty}^{\infty} \frac{1}{1 + �^{2}} � �,

and, secondly,

\int \frac{1}{1 + �^{2}} � � = atan (�) + a constant,

where

atan

denotes the arctangent function, and, lastly,

\begin{aligned} lim_{� \to \infty} atan (�) & = \frac{�}{2} \\ lim_{� \to - \infty} atan (�) & = - \frac{�}{2} \end{aligned}

3.1.2.3 Multivariate Continuous Probability Models

A distribution for two or more continuous random variables is the same except this is multivariable calculus. For example, a probability distribution for three variables $x$ , $y$ , and $z$ has a PDF that is nonnegative and integrates to one, but now this involves a triple integral

\begin{matrix} � (�, �, �) \geq 0, - \infty < � < \infty, - \infty < � < \infty, - \infty < � < \infty \\ \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} � (�, �, �) � � � � � � = 1 \end{matrix}

As in the case of univariate models, the PDF does not give probability but rather (as the name says) probability density:

f (x, y, z) d x d y d z

is the probability of the box

(x, x + d z) \times (y, y + d y) \times (z, z + d z)

d x

d y

, and

d z

are infinitesimal, but for finite regions, you have to do a triple integral

Pr {(�, �, �) \in �} = ∭_{�} � (�, �, �) � � � � � �

and that is something only learned in multivariable calculus.

3.1.2.4 Stochastic Independence

We say random variables are stochastically independent or statistically independent or independent (with no qualifying adjective) if

the values of any of them have nothing to do with the values of the others (this is the concept we use for applications), or
the PMF or PDF factors into a product of functions of one variable
$� (�_{1}, \dots, �_{�}) = \prod_{� = 1}^{�} �_{�} (�_{�})$ (this is the theoretical concept). This equation is so important that it has its own terminology: the phrase the joint distribution is the product of the marginal distributions, or even shorter, the joint is the product of the marginals, means the joint distribution of all the variables (the left-hand side of this equation) is equal to the product (on the right-hand side of this equation) of the marginal distributions, meaning $f_{i} (x_{i})$ is the PDF or PMF, as the case may be, of $X_{i}$ .

So we have two concepts of independence, one applied (that we use to tell use what applications can use this concept) and one theoretical (that we use to tell us how this concept affects the mathematics).

In statistics, we should never use independent with any other meaning to avoid confusion with any other notion of independence. In particular, in regression models we never say dependent and independent variables, but always say predictor and response variable instead.

The theoretical concept implies the applied concept because the PDF or PMF factoring implies that probability calculations will also factor: in the continuous case

Pr (�_{�} < �_{�} < �_{�}, � = 1, \dots, �) = \prod_{� = 1}^{�} \int_{�_{�}}^{�_{�}} �_{�} (�_{�}) � �_{�}

and in the discrete case

Pr (�_{�} < �_{�} < �_{�}, � = 1, \dots, �) = \prod_{� = 1}^{�} \sum_{\begin{matrix} �_{�} \in �_{�} \\ �_{�} < �_{�} < �_{�} \end{matrix}} �_{�} (�_{�})

(and we see that in the last case the sample space also has to factor as a Cartesian product

S_{1} \times S_{2} \times \dots \times S_{n}

so that the values of each variable that are possible have nothing to do with the values of the other — this was automatic in the continuous case because we took the sample space to be the Cartesian product

R^{n}

, the

n

-fold product of the real line with itself, the set of all

n

-tuples of real numbers, just like

S_{1} \times S_{2} \times \dots \times S_{n}

denotes the set of all

n

-tuples

(x_{1}, x_{2}, \dots, x_{n})

with

x_{i} \in S_{i}

for each

i

). And independence gets us back (almost) to univariate calculus. We have integrals or sums involving only one variable.

3.1.2.5 Independent and Identically Distributed (IID)

The phrase in the section title, so important that it gets its own TLA (three-letter acronym) is just the special case of independence where all the random variables have the same distribution, so the theoretical concept is

� (�_{1}, \dots, �_{�}) = \prod_{� = 1}^{�} � (�_{�})

(all the marginals on the right-hand side are the same distribution (here we have

f

where the analogous equation above had

f_{i}

3.1.2.6 Introductory Statistics versus Theoretical Statistics

In intro stats the only statistical model discussed is finite population sampling, there are $N$ individuals, which are taken to be fixed not random, a specified population. For example, the population could be the students registered at the University of Minnesota today at 8:00 a. m. (which students the university has changes over time). Then we take a simple random sample (SRS) of this population, which is a special case of IID. The random variables $X_{i}$ are measurements on each individual (quantitative or qualitative) selected for the sample. And SRS means the same as IID: whether one individual is selected for the sample has nothing to do with which other individuals are selected. This means that $X_{i}$ and $X_{j}$ can be the same individual: which individual $X_{i}$ is a measurement on has nothing to do with which individual $X_{j}$ is a measurement on, and this means, in particular, that we cannot require that these individuals cannot be the same individual (that would make $X_{i}$ have something to do with $X_{j}$ ). For those who have heard that terminology, we are talking about so-called sampling with replacement as opposed to sampling without replacement.

To be theoretically astute, you have to move out of finite population sampling and replace SRS with IID. In SRS we are sampling from a finite set (the population) so every variable is discrete whether we think of it that way or not. In IID we can have continuous random variables. But then the SRS story breaks down. Sampling from in infinite population doesn’t make sense.

Strangely statistics teachers and applied statisticians often use the terminology of SRS (the sample and the population) even when they are talking about IID (where those terms don’t make any sense — so they are making an imprecise analogy with finite population sampling).

3.1.2.7 Random Variables and Expectation

Applicationally, a random variable is any measurement on a random process. Theoretically, a random variable a function on the sample space. Either of these definitions make any function of a random variable or variables another random variable.

If $X$ is the original variable (taking values in the sample space) and $Y = g (X)$ , then the expectation or expected value or mean or mean value (all these terms mean the same thing) is

� (�) = � {� (�)} = \int_{- \infty}^{\infty} � (�) � (�) � �

in the continuous case and

� (�) = � {� (�)} = \sum_{� \in �} � (�) � (�)

and analogous formulas for multivariable cases, which we will try to avoid.

So to calculate the expectation (a. k. a. mean) of a random variable, you multiply the values of the random variable (here $g (x)$ ) by the corresponding probabilities or probability density (here $f (x)$ ) and sum or integrate, as the case may be.

3.1.2.8 Mean, Variance, and Standard Deviation

We have already said that the expectation of the variable itself is called the mean

� = � (�)

or sometimes when there is more that one variable under discussion we decorate the

μ

�_{�} = � (�) .

The expected squared deviation from the mean is another important quantity called the variance

�^{2} = var (�) = � {(� - �)^{2}}

or with decoration

�_{�}^{2} = var (�) = � {(� - �_{�})^{2}}

The standard deviation is the square root of the variance, and, conversely, the variance is the square of the standard deviation, always

\begin{aligned} sd (�) & = \sqrt{var (�)} \\ var (�) & = sd (�)^{2} \end{aligned}

Why to such closely related concepts? In applications the standard deviation is more useful because it has the same units as the variable. If $Y$ is measured in feet then $μ_{Y}$ also has units feet, but $(Y - μ_{Y})^{2}$ and $var (Y)$ have units square feet ( ${ft}^{2}$ ) so $sd (Y)$ is back to units feet (ft). But theoretically, the square root is a nuisance that just makes many formulas a lot messier than they need to be.

Here’s an example. The expectation of a sum is the sum of the expectations, always,

� (\sum_{� = 1}^{�} �_{�}) = \sum_{� = 1}^{�} � (�_{�})

and the variance of a sum is the sum of the variances, not always, but when the variables are independent,

var (\sum_{� = 1}^{�} �_{�}) = \sum_{� = 1}^{�} var (�_{�})

The latter is a lot messier when one tries to state it in terms of standard deviations

sd (\sum_{� = 1}^{�} �_{�}) = \sqrt{\sum_{� = 1}^{�} sd (�_{�})^{2}}

and standard deviation doesn’t look so simple any more.

3.1.2.9 Operations and Expectations

It is FALSE that a general operation can be taken outside of an expectation. For an arbitrary function $g$

� {� (�)} \neq � (� (�))

However this is true for some special situations. One can always take addition or subtraction out

\begin{aligned} � (� + �) & = � (�) + � (�) \\ � (� - �) & = � (�) - � (�) \end{aligned}

One can always take constants out

� (� �) = � � (�)

where

a

is constant (non-random).

One can always take linear functions out

\begin{aligned} � (� + � �) & = � + � � (�) \\ var (� + � �) & = �^{2} var (�) \end{aligned}

(the second of these doesn’t fit the pattern we are talking about here, but is very important and used a lot).

In the special case of independent random variables, one can take out multiplication and division

\begin{aligned} � (� �) & = � (�) � (�) \\ � (\frac{�}{�}) & = \frac{� (�)}{� (�)} \end{aligned}

but only if

X

and

Y

are independent random variables.

3.1.2.10 Mean and Variance of the Sample Mean

We can use these operations to prove the formulas for the mean and variance of the sample mean in the IID case

Search This Blog

K.CHANDRAN