Consider some process where individuals are being randomly sampled from a population, not placed back into the population before the next sample, and any individual from that population has equal probability of being selected. For the moment, we'll just assume that these individuals could be one of two types (e.g. male or female, infected or not infected, defective or not, Conservative or Liberal, pregnant or not, etc).
If the population is very large relative to the sample size, the probability that each individual being sampled is of one particular category is essentially fixed. For example, if we took a sample of size 10 from a set of 1000 bolts, of which 125 were defective, the probability that the first bolt sampled is defective is 125/1000 = 0.125. The probability that the second bolt is defective is 124/999 = 0.124124… if the first bolt was defective; and 125/999 = 0.125125… otherwise. The probability that the tenth bolt is defective will be between 116/991 = 0.117053…(but that scenario has less than 1 in 100 million of occurring) and 125/991 = 0.126135…with the most likely scenario being 117/991 = 0.125126… In other words, the probability is not deviating very significantly from its initial value of 0.125 for such small samples. Thus, it is a reasonable approximation to assume that the probability is constant, which makes the sampling process follow a Binomial process, and the number of defective bolts in the sample can be estimated using a binomial distribution as:
Defective bolts in sample = Binomial(125/1000, 10) = Binomial(0.125, 10)
A general rule of thumb (be careful, though, it depends on the level of accuracy you need) is that if the sample is less than 10% of the population, you can use the binomial approximation.
The much more interesting situation we want to get to here is where the sample is of the same order of magnitude as the population. In this situation, it is not accurate to use the binomial approximation. In fact, this is a hypergeometric process and the distribution of defective items is a hypergeometric distribution. So, for example, if we were sampling 25 bolts from a set of 100, where 33 are defective, the distribution would be:
Defective bolts in sample = Hypergeometric(33/100,25,100)
The binomial approximation would have been =Binomial(0.33, 25). The figure below shows that the Binomial distribution is not sufficiently close to the Hypergeometric, but was very close for the large population example above.
A couple more examples of the hypergeometric process:
10 out of 45 people in the list are males. If I randomly pick 15 names from that list, how many males would I get?
Answer: = Hypergeometric(10/45, 15, 45)
A manufacturer produces tires for cars. He accidentally mixed 3 defective tires among the lot of 100. How many defective tires would be shipped to the customer from this lot if the total number of tires shipped is 30?
Answer: =Hypergeometric(3/100, 30, 100)
modeling each sample, or sub-groups of samples, separately
The hypergeometric distribution provides a probability distribution of the total number in the sample that have the characteristic of interest, but does not give us the history of how each individual sample, or groups of samples, turned out. There may be situations where we need to know that.
If we are looking at consecutive samples, we can just nest Hypergeometric distributions. Problems 1 and 2 provide some examples.
If we are interested in the outcome of each consecutive trial, each trial is just a Binomial distribution with n = 1, and p = (Number remaining "defective')/(Number remaining in population).
Imagine that we produce specialist power units. We deliver these units to the client in batches of ten. The client has a quality control procedure for each consignment, as follows:
Three units are tested. If two or more of these samples are defective, the consignment is rejected. If one is defective, another three are tested, and if any of these second set are defective the consignment is also rejected. We want to construct a model that looks at the risk of rejection of a consignment for different numbers of defective power units. The model Power Units offers a solution and makes use of the simulation software package's Decision Table tool. To run this tool in the Power Units example, one has to go through the steps explained in the model.
The links to the Power Units software specific models are provided here:
Another, this time rather macabre, example:
A bag contains 100 sweets. 20 sweets contain arsenic, but all the sweets look the same. 20 people take 5 sweets each. You are one of them! A person eating just 1 arsenic sweet has a 50% probability of dying. A person eating 2 or more sweets will certainly die.
a) How many arsenic sweets will you get?
b) What is the probability you will die?
c) How many people will die?
The model Arsenic Sweets offers a solution.
This is a really interesting problem, that offers some nice lessons in probability modeling, even if the example is a little far-fetched. It turns out that the probability distribution of the number of arsenic sweets a person gets is the same irrespective of whether they are first or last to take their sweets out of the bag, i.e. Hypergeometric(20/100,5,100). You can check that out by running simulations on any two of the cells modeling the arsenic sweets a person gets and comparing their distributions.
That can seem a bit strange – maybe you would have preferred to be the last person to pick your sweets, reasoning that all of the arsenic sweets have probably gone already. It might also seem strange that the last person's distribution, which is a function of what has happened to all of the previous 19 samples, is still the same distribution as the first person.
The intuitive way to look at this is to think of the symmetry of the problem. Let's imagine we have all 20 people sitting around a table, with a lazy Susan plate in the middle (those plates that spin round). On the lazy Susan are twenty little dishes, into which are placed five sweets each. Intuitively, you can probably see that it doesn't matter whether we spin the table, or not, or where the table would end up: if a person has to take the sweets in front of them the risk is the same.
The problem is also instructive in that we really did have to model each person's distribution of arsenic sweets. We couldn't take a short cut, modeling one person, and extrapolating somehow. This is because the total arsenic sweets has to add up to 20, which means that although each person's marginal distribution for arsenic sweets is the same, in fact this is just one joint distribution with twenty dimensions, one for each person.
The links to the Arsenic Sweets software specific models are provided here:
More than two different outcomes
So far we have dealt with scenarios where each individual can only take one of two states. However, in many problems, an individual may take several states, for example: Labour, Liberal, Conservative, or Green; not infected, sub-clinically infected, or clinically infected; Caucasian, Asian, African, or Aboriginal; Dell, Compaq, IBM, or Toshiba.
Sampling from a small population now becomes a multivariate hypergeometric process, for which the link provides generating models.