To learn more about EpiX Analytics' work, please visit our modeling applications, white papers, and training schedule.

Page tree


MultiHypergeo(n,{Di})

Multivariate hypergeometric equations



The Multivariate Hypegeomeric distribution is an extension of the Hypergeometric distribution where more than two different states of individuals in a group exist.


Example

In a group of 50 people, of whom 20 were male, a Hypergeometric(20/50,10,50) would describe how many from ten randomly chosen people would be male (and by deduction how many would therefore be female). However, let's say we have a group of 10 people as follows:


German

English

French

Canadian

3

2

1

4


Now let's take a sample of 4 people at random from this group. We could have various numbers of each nationality in our sample:


German

English

French

Canadian

3

1

0

0

3

0

1

0

3

0

0

1

2

2

0

0

2

1

1

0

2

1

0

1

2

0

2

0

2

0

1

1

2

0

0

2

...

...

...

...

Etc.





and each combination has a certain probability. The Multivariate Hypergeometric distribution is an array distribution, in this case generating simultaneously four numbers, that returns how many individuals in the random sample came from each sub-group (e.g. German, English, French, and Canadian).


Generation

The Multivariate Hypergeometric distribution is created by extending the mathematics of the Hypergeometric distribution. For the Hypergeometric distribution with a sample of size n, the probability of observing s individuals from a sub-group of size M, and therefore (n-s) from the remaining number (M-D):




and results in the probability distribution for s:



f(x)=\frac{\left( \begin{array}{c} D \\ x \end{array} \right) \left( \begin{array}{c} M-D \\ n-x \end{array} \right)}{\left( \begin{array}{c} M \\ n \end{array} \right) }


where M is the group size, and D is the sub-group of interest. The numerator is the number of different sampling combinations (each of which has the same probability because each individual has the same probability of being sampled) where one would have exactly s from the sub-group D (and by implication (n-s) from the sub-group (M-D). The denominator is the total number of different combinations of individuals one could have in selecting n individuals from a group of size M. Thus the equation is just the proportion of different possible scenarios, each of which has the same probability, that would give us s from D.


The Multivariate Hypergeometric probability equation is just an extension of this idea. The figure below shows the graphical representation of the multivariate hypergeometric process: D1, D2, D3 and so on are the number of individuals of different types in a population, and x1, x2, x3, ... are the number of successes (the number of individuals in our random sample (circled) belonging to each category).


 



and results in the probability distribution for {s}:



f(x)=\frac{\left( \begin{array}{c} D_1 \\ x_1 \end{array} \right) \left( \begin{array}{c} D_2 \\ x_2 \end{array} \right) \dots \left( \begin{array}{c} D_k \\ x_k \end{array} \right) }{\left( \begin{array}{c} M \\ n \end{array} \right)}


where \displaystyle\sum_{i=1}^{k} D_i=M,\displaystyle\sum_{i=1}^{k} x_i=n


Example and generation

Let's imagine a problem where we have 100 colored balls in a bag, from which 10 are red, 15 purple, 20 blue, 25 green and 30 yellow. Without looking into the bag, you take 30 balls out. How many balls of each color will you take from the bag?


We cannot model this problem using the multinomial distribution, because when we take the first ball out, the proportions of the different color balls in the bag change. The same happens when we take the second ball out and so on.


Thus, we must proceed as follows:


  • Model the first color (red for example) as x1= Hypergeometric(D1/M, s ,M) , where s is the sample size = 30, D1 is the total number of red balls in the bag = 10, and M is the population size - 100
     

  • Model the rest as: xi = Hypergeometric (Di /SUM(Di : Dn), s - SUM(x1: x i-1) , SUM(Di : Dn)), where xi is the number of successes of the type i in a sample, xi-1 is the number of successes of the type i-1 in a sample, Di number of successes of type i in the total population, Dn in the number of successes of the last type in the total population.


The solution to this problem can be reached in the model: Multivariate Hypergeometric

The links to the Multivariate Hypergeometric software specific models are provided here:




  • No labels