To learn more about EpiX Analytics' work, please visit our modeling applications, white papers, and training schedule.

Page tree



The InvHypergeo(s, D, M) distribution describes the possible number of trials one may have before achieving s successes, where a trial is a sample without replacement from a population of size M, and where a success is defined as picking one of the D items in the population of size M that have some particular characteristic. So, for example, the total number of animals one needs to select at random to have s infected animals taken from a population M, where D of that population are known to be infected, is described by an InvHypergeo(s, D, M) distribution. The probability mass function for the InvHypergeo distribution is a mass of factorial calculations, which is quite laborious to calculate and leads us to look for suitable approximations.



NegBinomial approximation to the InvHypergeo

 

A hypergeometric process is sampling from a finite population without replacement, so that the result of a sample is dependent on the samples that have gone before it. If the population is very large, so that removing a sample of size n has no discernible effect on the population, then the probability that an individual sample will have the characteristic of interest is essentially constant and has the value D/M, because the probability of resampling an item in the population, were one to replace items after sampling, would be very small. In such cases, the hypergeometric process can be approximated by a binomial process.


The rule most often quoted is that this approximation works well when n < 0.1 M. The expected number of trials to achieve s successes is given by [s(M+1)/(D+1)], and recognizing that M and D are large so M+1 » M, D+1 » D, we have the condition:


s < 0.1 D


For a binomial process the total number of trials to achieve s successes, where p is the probability of success, is given by a NegBinomial(p,s), thus:


InvHypergeo(s, D, M) » NegBinomial(D/M,s)                       when s < 0.1 D



Example of a NegBinomial distribution approximation to an InvHypergeometric distribution.


Gamma approximation to the InvHypergeo

 

We have just seen how the InvHypergeo distribution can be approximated by the NegBinomial, providing s < 0.1 D, by approximating a hypergeometric process to a binomial process. We have also shown elsewhere that the binomial process can be approximated by the Poisson process, providing n is large and p is small. Thus, providing n is large and D/M is small, a hypergeometric process is well approximated by a Poisson process. In a Poisson process, the Gamma(0,b,a) distribution models the 'time' until observing a events where b is the mean time between events. The InvHypergeo distribution is the hypergeometric equivalent, modeling the total number of trials to achieve s successes where [(M-D)/(D+1)] is the mean number of failures per success. While the InvHypergeo includes the s successes, the Poisson process does however not include successes in the waiting time because each event is assumed to be instantaneous. Therefore, to make the two approaches exactly comparable, we have to subtract the number of successes from the InvHypergeo distribution to obtain the number of failures (e.g. we need to shift the InvHyperGeo distribution s to the left). In addition, we should think of the mean number of trials per success, equal to [(M-D)/(D+1) +1] = [(M+1)/(D+1) » D/M]. Then, we can make the following approximation:


InvHypergeo(s, D, M) - s » Gamma(0, (M+1)/(D+1), s)               when s < 0.1 D,  D/M ® 0


or


InvHypergeo(s, D, M)     » Gamma(s, (M+1)/(D+1), s)               when s < 0.1 D,  D/M ® 0




Two examples of a Gamma approximation to an InvHypergeo.
Note the approximation gets better as
s<<D and D/M gets smaller.


Normal approximation to the InvHypergeo

 

Central Limit Theorem tells that the sum of a large number (n) of independent, identically distributed random variables will approach a Normal distribution:

 

 

Sum = Normal(nm,sn)

 

where m and s are the mean and standard deviations for the individual random variables.

 

In a hypergeometric process, the number of possible trials to observe 1 success is given by InvHypergeo(1,D,M) which has moments given by:

 

 

\mu =\frac{(M+1)}{(D+1)}

 

 

\sigma ^{2}=\frac{(M-D)(M+1)D}{(D+1)^{2}(D+2)}

 

For each success to have the same distribution of trial the number of individuals taken from the population must be a very small fraction, i.e.:

 

 

\frac{s(M+1)}{(D+1)}<<M

 

At the same time, s must be large to be adding up a large number of these distributions. Under these conditions, the InvHypergeo distribution can be approximated by a Normal distribution:

 

 

Inv Hypergeo (s,D,M)\approx Normal \Big(\frac{s(M+1)}{D+1},\sqrt{\frac{s(M-D)(M+1)D}{(D+1)^{2}(D+2)}}\Big)

 

or, more approximately:

 

 

Inv Hypergeo (s,D,M)\approx Normal\Big(\frac{sM}{D},\sqrt{\frac{s(M-D)M}{D^{2}}}\Big)

 

 

 

 

 


  • No labels