General Statistical Models
General Statistical Models
The time has come to learn some theory. This is a preview of STAT 5101–5102. We don’t need to learn much theory. We will proceed with the general strategy of all introductory statistics: don’t derive anything, just tell you stuff.
3.1 Probability Models
3.1.1 Kinds of Probability Theory
There are two kinds of probability theory. There is the kind you will learn in STAT 5101-5102 (or 4101–4102, which is more or less the same except for leaving out the multivariable stuff). The material covered goes back hundreds of years, some of it discovered in the early 1600’s. And there is the kind you would learn in MATH 8651–8652 if you could take it, which no undergraduate does (it is a very hard Ph. D. level math course). The material covered goes back to 1933. It is all “new math”. These two kinds of probability can be called classical and measure-theoretic, respectively.
The old kind (classical) is still very useful. For every 1000 people who know a lot of probability theory, 999 know only the classical theory. This includes a lot of working scientists. And it is a lot easier than the new kind (measure-theoretic), which is why we still teach it at the undergraduate and master’s level and Ph. D. level scientists in all fields. (Only math and stat Ph. D. students take the measure-theoretic probability course.) And Minnesota is not different from any university in this respect.
3.1.2 Classical Probability Models
In classical probability theory there are two kinds of probability models (also called probability distributions). They are called discrete and continuous. The fact that there are two kinds means everything has to be done twice, once for discrete, once for continuous.
3.1.2.1 Discrete Probability Models
A discrete probability model is specified by a finite set called the sample space and a real-valued function on the sample space called the probability mass function (PMF) of the model. A PMF satisfies two properties
That should sound familiar, just like what they told you probability was in your intro statistics course. The only difference is that, now that you know calculus, can be an infinite set so the summation here can be an infinite sum.
In principle, the sample space can be any set, but all discrete probability models that are well known, have names, and are used in applications have sample spaces that are subsets of the integers. Here are a few examples.
3.1.2.1.1 The Binomial Distribution
The binomial distribution describes the number of successes in stochastically independent and identically distributed (IID) random process that can only have two outcomes, conventionally called success and failure, although they could be anything, the important point is that there are only two possible outcomes.
If is the probability of success in any single trial then the probability of successes in trials is
The fact that probabilities sum to one is a special case of the binomial theorem
The binomial distribution is very important. It arises any time there are data with only two outcomes: yes or no, for or against, vanilla or chocolate, whatever.
There is a generalization called the multinomial distribution that allows for any (finite) number of outcomes. But we won’t bother with that. (It is the basis of STAT 5421, categorical data analysis.)
3.1.2.1.2 The Poisson Distribution
The Poisson distribution (named after a man named Poisson, it’s not about fish) describes the number of things in any part of a stochastic process where the locations of things are stochastically independent (one are affected by any of the others). Examples would be the number of winners of a lottery, the number of raisins in a slice of carrot cake, the number of red blood cells in a drop of blood, the number of visible stars in a region of the sky, the number of traffic accidents in Minneapolis today. It doesn’t matter what is counted, so long as the thingummies counted have nothing to do with each other, you get the Poisson distribution.
Its PMF is
The fact that probabilities sum to one is a special case of the Maclaurin series (Taylor series around zero) of the exponential function
The Poisson distribution was initially derived from the binomial distribution. It is what you get when you let go to zero in the binomial PMF in such a way so that . So the Poisson distribution is an approximation for the binomial distribution when is very large is very small, and is moderate sized. This illustrates how one probability distribution can be derived from another.
3.1.2.1.3 The Zero-Truncated Poisson Distribution
We already met the zero-truncated Poisson distribution. This arises when you have a Poisson distribution except for zero counts. There may be other reasons why zero occurs other than Poisson variation; the chef may have forgotten the raisins in the recipe rather than your slice of carrot cake has no raisins for no other reason other than chance variation — that’s just the way things came out in the mixing of the batter and slicing of the cake.
The zero-truncated Poisson distribution is widely used in aster models, and we used it as an example of a function that requires extreme care if you want to calculate it accurately using computer arithmetic (supplementary notes).
The exact definition is that the zero-truncated Poisson distribution is what you get when you take Poisson data and throw out all the zero counts. So its PMF is the PMF of the Poisson distribution with zero removed from the sample space and all of the probabilities re-adjusted to sum to one.
For the Poisson distribution so the probability of nonzero is so the zero-truncated Poisson distribution has PMF
This is another illustration of how one probability distribution can be derived from another.
3.1.2.2 Univariate Continuous Probability Models
A univariate continuous probability model is specified by a a real-valued function of one real variable called the probability density function (PDF) of the model. A PDF satisfies two properties
We say that is the probability of an outcome in the interval from to when is very small. For this to be exactly correct has to be infinitesimal. To get the probability for a finite interval, one has to integrate
That should sound familiar, just like what they told you probability was in your intro statistics course. Integrals are area under a curve. Probability is area under a curve (for continuous distributions).
3.1.2.2.1 The Normal Distribution
The normal distribution arises whenever one averages a large number of IID random variables (with one proviso, which we will discuss later). This is called the central limit theorem (CLT).
Its PDF is
The fact that this integrates to one is something they didn’t teach you in calculus of one variable (because the trick of doing it involves multivariable calculus, in particular, polar coordinates).
The special case when and is called the standard normal distribution. Its PDF is
3.1.2.2.2 The Cauchy Distribution
The Cauchy distribution arises in no applications I know of. It is a mathematical curiosity mostly useful as a counterexample. Many things that are true of other distributions don’t hold for Cauchy. For example, the average of IID Cauchy random variables does not obey the CLT (more on this later). If and are independent standard normal random variables, then has a Cauchy distribution, but this is an operation that does not seem to arise in applications.
Its PDF is
The fact that these integrate to one involves, firstly, change-of-variable, the substitution establishing that
3.1.2.3 Multivariate Continuous Probability Models
A distribution for two or more continuous random variables is the same except this is multivariable calculus. For example, a probability distribution for three variables , , and has a PDF that is nonnegative and integrates to one, but now this involves a triple integral
3.1.2.4 Stochastic Independence
We say random variables are stochastically independent or statistically independent or independent (with no qualifying adjective) if
the values of any of them have nothing to do with the values of the others (this is the concept we use for applications), or
the PMF or PDF factors into a product of functions of one variable
(this is the theoretical concept). This equation is so important that it has its own terminology: the phrase the joint distribution is the product of the marginal distributions, or even shorter, the joint is the product of the marginals, means the joint distribution of all the variables (the left-hand side of this equation) is equal to the product (on the right-hand side of this equation) of the marginal distributions, meaning is the PDF or PMF, as the case may be, of .
So we have two concepts of independence, one applied (that we use to tell use what applications can use this concept) and one theoretical (that we use to tell us how this concept affects the mathematics).
In statistics, we should never use independent with any other meaning to avoid confusion with any other notion of independence. In particular, in regression models we never say dependent and independent variables, but always say predictor and response variable instead.
The theoretical concept implies the applied concept because the PDF or PMF factoring implies that probability calculations will also factor: in the continuous case
3.1.2.5 Independent and Identically Distributed (IID)
The phrase in the section title, so important that it gets its own TLA (three-letter acronym) is just the special case of independence where all the random variables have the same distribution, so the theoretical concept is
3.1.2.6 Introductory Statistics versus Theoretical Statistics
In intro stats the only statistical model discussed is finite population sampling, there are individuals, which are taken to be fixed not random, a specified population. For example, the population could be the students registered at the University of Minnesota today at 8:00 a. m. (which students the university has changes over time). Then we take a simple random sample (SRS) of this population, which is a special case of IID. The random variables are measurements on each individual (quantitative or qualitative) selected for the sample. And SRS means the same as IID: whether one individual is selected for the sample has nothing to do with which other individuals are selected. This means that and can be the same individual: which individual is a measurement on has nothing to do with which individual is a measurement on, and this means, in particular, that we cannot require that these individuals cannot be the same individual (that would make have something to do with ). For those who have heard that terminology, we are talking about so-called sampling with replacement as opposed to sampling without replacement.
To be theoretically astute, you have to move out of finite population sampling and replace SRS with IID. In SRS we are sampling from a finite set (the population) so every variable is discrete whether we think of it that way or not. In IID we can have continuous random variables. But then the SRS story breaks down. Sampling from in infinite population doesn’t make sense.
Strangely statistics teachers and applied statisticians often use the terminology of SRS (the sample and the population) even when they are talking about IID (where those terms don’t make any sense — so they are making an imprecise analogy with finite population sampling).
3.1.2.7 Random Variables and Expectation
Applicationally, a random variable is any measurement on a random process. Theoretically, a random variable a function on the sample space. Either of these definitions make any function of a random variable or variables another random variable.
If is the original variable (taking values in the sample space) and , then the expectation or expected value or mean or mean value (all these terms mean the same thing) is
So to calculate the expectation (a. k. a. mean) of a random variable, you multiply the values of the random variable (here ) by the corresponding probabilities or probability density (here ) and sum or integrate, as the case may be.
3.1.2.8 Mean, Variance, and Standard Deviation
We have already said that the expectation of the variable itself is called the mean
The expected squared deviation from the mean is another important quantity called the variance
The standard deviation is the square root of the variance, and, conversely, the variance is the square of the standard deviation, always
Why to such closely related concepts? In applications the standard deviation is more useful because it has the same units as the variable. If is measured in feet then also has units feet, but and have units square feet () so is back to units feet (ft). But theoretically, the square root is a nuisance that just makes many formulas a lot messier than they need to be.
Here’s an example. The expectation of a sum is the sum of the expectations, always,
3.1.2.9 Operations and Expectations
It is FALSE that a general operation can be taken outside of an expectation. For an arbitrary function
One can always take linear functions out
In the special case of independent random variables, one can take out multiplication and division
3.1.2.10 Mean and Variance of the Sample Mean
We can use these operations to prove the formulas for the mean and variance of the sample mean in the IID case
Comments
Post a Comment