To learn more about EpiX Analytics' work, please visit our modeling applications, white papers, and training schedule.

Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: fixed link



 

 

We sometimes need to recognize the inter-relationship between probabilities of values for two or more distributions.

...

In some simulation software packages, specific features, such as Rank Order Correlation in Crystal Ball, and other modeling methods allow us to crudely model correlations between several distributions. However, there are certain situations where specific multinomial distributions are needed. 


The following three common multivariate distributions are described here:

  • Multinomial

  • Dirichlet

  • and Multivariate Hypergeometric

 


Multinomial distribution

 


For a set of n trials, each of which could take one of k different outcomes (like different colours of balls in a huge urn) with probabilities p1..pk, the distribution of the outcomes is known as multinomial, which is just an extension of the binomial distribution. The only difference is the number of possible outcomes: only two for the binomial and multiple for the multinomial. The Multinomial distribution has the following probability mass function:

...

LaTeX Math Block
alignmentleft
where \; \displaystyle\sum_{i=1}^{k} s_i = n  \quad and  \quad \displaystyle\sum_{i-1}^{k}p_i =1


 

 
LaTeX Math Inline
body\left( \begin{array}{c} n \\ S_1 ~ ~ S_2 ~ ~ ... S_k \end{array} \right) = \frac{\bigg(\displaystyle\sum_{i=1}^{k} s_i\bigg) !}{s_1!s_2!...s_k!}
                    is sometimes known as Multinomial coefficient.

 


Let's consider the following problem: The cars in a city are divided into 9 different categories. We know the proportion of the city's cars that are in each category. If we were to monitor 1000 cars that enter a particular motorway, how many cars of each category would we see?

 


This is clearly a problem of multinomial trials since every car that enters the motorway can be any one of 9 types.

...

Note that the marginal distribution (the distribution of the parameter looked at in isolation, i.e. with the uncertainty of all the other parameters integrated out) for sj (i.e. the distribution of generated values for sj when looked at by itself) is simply a binomial(n,pj). 


An example of a multinomial distribution can be found in the Multinomial model.

...

Expand
titleCrystal Ball

  Multinomial

Here is a screenshot of the model:

 


 


Our first category is simulated in cell C14, which is just a binomial distribution : Binomial(5%, 1000)

As the second category now needs to take into account the result from the first type, the formula in cell D14 becomes as shown above - number of trials is decreased by the number of successes from the first category (see D13), and the probability of success becomes the probability of category two divided by the sum of the probabilities of the remaining 8 categories (see D12). This logic is consistent throughout the "Successes" row (cells D15: K15).

...

Expand
title@Risk

  Multinomial

 Here is a screenshot of the model:

 


 


Our first category is simulated in cell C11, which is just a binomial distribution : RiskBinomial(1000, 5%)

As the second category now needs to take into account the result from the first type, the formula in cell D11 becomes as shown above - number of trials is decreased by the number of successes from the first category, and the probability of success becomes the probability of category two divided by the sum of the probabilities of the remaining 8 categories.

This logic is consistent throughout the "Successes" row (cells D11: K11), and the row "Outputs" shows a nice way of naming the output cells.



Dirichlet distribution 


The conjugate to the multinomial distribution is the Dirichlet distribution, much like the beta distribution is the conjugate to the binomial distribution. The Dirichlet distribution is used for modeling the uncertainty around probabilities of successes in multinomial trials.

...

LaTeX Math Block
alignmentleft
f(p_1,p_2,...p_k)=\frac{\Gamma\big(\alpha_1+a_2+...+a_k\big)}{\Gamma\big(a_1\big) \Gamma\big(a_2\big) ...\Gamma \big(a_k\big)} p_1^{\;\alpha_{1^{-1}}}...p_k^{\;a_{k^{-1}}}

...


For example, if you've observed s1, s2, … sk of different types of outcomes from n trials, the Dirichlet distribution provides the confidence distribution about the correct values for the probability that a random trial will produce each type of outcome by setting a1 = s1+1,. Obviously these probabilities have to sum to 1, so their uncertainty distributions are inter-related.

 


Let's take the same problem that we used in the previous example: All cars in a city are divided into 9 different types. But now we have monitored 100 cars that were entering a particular motorway, and counted the number of cars of each type. What is the uncertainty distribution for the proportions of each type in the total population of cars?

 


In order to solve this problem we need to go through the following steps: 


  • Simulate p1 = Beta(a1, SUM(a2..ak),1)

  • For each remaining probability (except pk), simulate p2, p3, …,, pk-1 in order with
    pj = (1-SUM(p1..pj-1))*Beta(aj, SUM(aj+1..ak),1)

  • Finally set pk = 1-SUM(p1..pk-1)

...

Expand
titleCrystal Ball

  Dirichlet

Putting the above logic into a spreadsheet model looks like this: 


 


Row "Alphas" (Cells C10:K10) is calculating the values for aj , which are just sj+1

Cell C12 is calculating the confidence distribution for the proportion of category 1 cars using Crystal Ball's Beta distribution with the first parameter equal to a1 and the second parameter equal to the summation of a2 to a9.

The second part of the equation in cells D13 to J13 follow similar logic as C13, which is then multiplied by the (1 - sum of the previous cells in the same row). The last Cell K13 calculates the implied probability for the last category as 1-sum(C13:J13).

...

Expand
title@Risk

  Dirichlet

Putting the above logic into a spreadsheet model looks like this: 


 


Row "Alphas" (Cells C9:K9) is calculating the values for aj , which are just sj+1

Cell C10 is calculating the confidence distribution for the proportion of category 1 cars using a RiskBeta function with the first parameter equal to a1 and the second parameter equal to the summation of a2 to a9.

The second part of the equation in cells D10 to J10 follow similar logic as C10, which is then multiplied by the (1 - sum of the previous cells in the same row). The last Cell K10 calculates the implied probability for the last category as 1-sum(C10:J10).

 


The Dirichlet distribution is not as intuitive as the Multinomial distribution, but it is a very handy tool when modeling multinomial trials.

 

 



Multivariate hypergeometric

 


Sometimes we need to model sampling from a population without replacement with multiple outcomes and when the population is small so the process cannot be approximated to a multinomial where the probabilities of success remain constant. In this case we use the multivariate hypergeometric distribution, which is similar to the hypergeometric distribution, with the difference in the number of possible outcomes from a trial (two - in the hypergeometric and many - in the multivariate hypergeometric).

 


The figure below shows the graphical representation of the multivariate hypergeometric process. D1, D2, D3 and so on are the number of individuals of different types in a population, and x1, x2, x3, ... are the number of successes (the number of individuals in our random sample (circled) belonging to each category). 


 


The Multivariate hypergeometric distribution has the following probability mass function:

 


LaTeX Math Block
alignmentleft
f(x)=\frac {\left(
    \begin{array}{c}
      D_1 \\
    x_1
    \end{array}
	 \right)\left(
    \begin{array}{c}
      D_2 \\
    x_2
    \end{array}
	 \right)...\left(
    \begin{array}{c}
      D_k \\
    x_k
    \end{array}
	 \right)}{\left(
    \begin{array}{c}
     M \\
    n
    \end{array}
	 \right)}

, where 

LaTeX Math Inline
body\displaystyle\sum_{i=1}^k D_i = M, \displaystyle\sum_{i=1}^k x_i = n

 


Let's imagine a problem where we have 100 colored balls in a bag, from which 10 are red, 15 purple, 20 blue, 25 green and 30 yellow. Without looking into the bag, you take 30 balls out. How many balls of each color will you take from the bag? 


We cannot model this problem using the multivariate distribution, because when we take the first ball out, the proportions of the different color balls in the bag change. The same happens when we take the second ball out and so on. 


Thus, we must proceed as follows:

 


  • Model the first color (red for example) as x1= Hypergeometric(s1 ,N, M) , where s1 is the the type 1 population = 10, N is the sample size = 30, and M is the population size = 100
     

  • Model the rest as: xi = Hypergeometric (si, N - SUM(x1: xi-1), SUM(Di : Dn)) , where si is the the type i population, xi is the number of successes of the type i in a sample, xi-1 is the number of successes of the type i-1 in a sample, Di is the number of successes of type i in the total population, Dn is the number of successes of the last type in the total population.

 


The solution to this problem can be reached here: Multivariate Hypergeometric

...

Expand
title@Risk

  Multivariate_hypergeometric

 

...