To learn more about EpiX Analytics' work, please visit our modeling applications, white papers, and training schedule.

Page tree



We sometimes need to recognize the inter-relationship between probabilities of values for two or more distributions.

In other words, these distributions are not independent of each other.

In some simulation software packages, specific features, such as Rank Order Correlation in Crystal Ball, and other modeling methods allow us to crudely model correlations between several distributions. However, there are certain situations where specific multinomial distributions are needed.


The following three common multivariate distributions are described here:

  • Multinomial

  • Dirichlet

  • and Multivariate Hypergeometric


Multinomial distribution


For a set of n trials, each of which could take one of k different outcomes (like different colours of balls in a huge urn) with probabilities p1..pk, the distribution of the outcomes is known as multinomial, which is just an extension of the binomial distribution. The only difference is the number of possible outcomes: only two for the binomial and multiple for the multinomial. The Multinomial distribution has the following probability mass function:

p(s_1,s_2,...,s_k) = \left( \begin{array}{c} n \\ {s_1 ~ ~ s_2 ~ ~ ... ~ ~s_k} \end{array} \right) p_1^{s_1}...p_k^{s_k}

where \; \displaystyle\sum_{i=1}^{k} s_i = n \quad and \quad \displaystyle\sum_{i-1}^{k}p_i =1


  \left( \begin{array}{c} n \\ S_1 ~ ~ S_2 ~ ~ ... S_k \end{array} \right) = \frac{\bigg(\displaystyle\sum_{i=1}^{k} s_i\bigg) !}{s_1!s_2!...s_k!}                    is sometimes known as Multinomial coefficient.


Let's consider the following problem: The cars in a city are divided into 9 different categories. We know the proportion of the city's cars that are in each category. If we were to monitor 1000 cars that enter a particular motorway, how many cars of each category would we see?


This is clearly a problem of multinomial trials since every car that enters the motorway can be any one of 9 types.

To sample from a multinomial distribution we need to proceed as follows:

We know p1, p2,…,pk (proportions of each type) and n (our sample size - 1000).

First we simulate from binomial(n, p1) – this gives us s1.

For each remaining category, we simulate s2, s3, …,, sk in order with sj = binomial(n-SUM(s1…sj-1),pj/SUM(pj…pk))

Note that the marginal distribution (the distribution of the parameter looked at in isolation, i.e. with the uncertainty of all the other parameters integrated out) for sj (i.e. the distribution of generated values for sj when looked at by itself) is simply a binomial(n,pj).


An example of a multinomial distribution can be found in the Multinomial model.

The links to the Multinomial software specific models are provided here:

  Multinomial

Here is a screenshot of the model:



Our first category is simulated in cell C14, which is just a binomial distribution : Binomial(5%, 1000)

As the second category now needs to take into account the result from the first type, the formula in cell D14 becomes as shown above - number of trials is decreased by the number of successes from the first category (see D13), and the probability of success becomes the probability of category two divided by the sum of the probabilities of the remaining 8 categories (see D12). This logic is consistent throughout the "Successes" row (cells D15: K15).

  Multinomial

 Here is a screenshot of the model:



Our first category is simulated in cell C11, which is just a binomial distribution : RiskBinomial(1000, 5%)

As the second category now needs to take into account the result from the first type, the formula in cell D11 becomes as shown above - number of trials is decreased by the number of successes from the first category, and the probability of success becomes the probability of category two divided by the sum of the probabilities of the remaining 8 categories.

This logic is consistent throughout the "Successes" row (cells D11: K11), and the row "Outputs" shows a nice way of naming the output cells.



Dirichlet distribution


The conjugate to the multinomial distribution is the Dirichlet distribution, much like the beta distribution is the conjugate to the binomial distribution. The Dirichlet distribution is used for modeling the uncertainty around probabilities of successes in multinomial trials.

The Dirichlet distribution has the following probability density function:

f(p_1,p_2,...p_k)=\frac{\Gamma\big(\alpha_1+a_2+...+a_k\big)}{\Gamma\big(a_1\big) \Gamma\big(a_2\big) ...\Gamma \big(a_k\big)} p_1^{\;\alpha_{1^{-1}}}...p_k^{\;a_{k^{-1}}}


For example, if you've observed s1, s2, … sk of different types of outcomes from n trials, the Dirichlet distribution provides the confidence distribution about the correct values for the probability that a random trial will produce each type of outcome by setting a1 = s1+1,. Obviously these probabilities have to sum to 1, so their uncertainty distributions are inter-related.


Let's take the same problem that we used in the previous example: All cars in a city are divided into 9 different types. But now we have monitored 100 cars that were entering a particular motorway, and counted the number of cars of each type. What is the uncertainty distribution for the proportions of each type in the total population of cars?


In order to solve this problem we need to go through the following steps:


  • Simulate p1 = Beta(a1, SUM(a2..ak),1)

  • For each remaining probability (except pk), simulate p2, p3, …,, pk-1 in order with
    pj = (1-SUM(p1..pj-1))*Beta(aj, SUM(aj+1..ak),1)

  • Finally set pk = 1-SUM(p1..pk-1)

Note that the marginal distribution of pj is  Beta(sj+1, n+k-sj-1, 1), which means that if we have no information (i.e. n = 0, sj = 0), the prior is Beta(1,k-1,1) which has a mean of 1/k. This is not a Uniform distribution. That makes sense because the probabilities must add up to 1, so the means of their uncertainties must add up to 1.

The solution to this problem can be found in Dirichlet model:

The links to the Dirichlet software specific models are provided here:

  Dirichlet

Putting the above logic into a spreadsheet model looks like this:



Row "Alphas" (Cells C10:K10) is calculating the values for aj , which are just sj+1

Cell C12 is calculating the confidence distribution for the proportion of category 1 cars using Crystal Ball's Beta distribution with the first parameter equal to a1 and the second parameter equal to the summation of a2 to a9.

The second part of the equation in cells D13 to J13 follow similar logic as C13, which is then multiplied by the (1 - sum of the previous cells in the same row). The last Cell K13 calculates the implied probability for the last category as 1-sum(C13:J13).

  Dirichlet

Putting the above logic into a spreadsheet model looks like this:



Row "Alphas" (Cells C9:K9) is calculating the values for aj , which are just sj+1

Cell C10 is calculating the confidence distribution for the proportion of category 1 cars using a RiskBeta function with the first parameter equal to a1 and the second parameter equal to the summation of a2 to a9.

The second part of the equation in cells D10 to J10 follow similar logic as C10, which is then multiplied by the (1 - sum of the previous cells in the same row). The last Cell K10 calculates the implied probability for the last category as 1-sum(C10:J10).


The Dirichlet distribution is not as intuitive as the Multinomial distribution, but it is a very handy tool when modeling multinomial trials.



Multivariate hypergeometric


Sometimes we need to model sampling from a population without replacement with multiple outcomes and when the population is small so the process cannot be approximated to a multinomial where the probabilities of success remain constant. In this case we use the multivariate hypergeometric distribution, which is similar to the hypergeometric distribution, with the difference in the number of possible outcomes from a trial (two - in the hypergeometric and many - in the multivariate hypergeometric).


The figure below shows the graphical representation of the multivariate hypergeometric process. D1, D2, D3 and so on are the number of individuals of different types in a population, and x1, x2, x3, ... are the number of successes (the number of individuals in our random sample (circled) belonging to each category).



The Multivariate hypergeometric distribution has the following probability mass function:


f(x)=\frac {\left( \begin{array}{c} D_1 \\ x_1 \end{array} \right)\left( \begin{array}{c} D_2 \\ x_2 \end{array} \right)...\left( \begin{array}{c} D_k \\ x_k \end{array} \right)}{\left( \begin{array}{c} M \\ n \end{array} \right)}

, where  \displaystyle\sum_{i=1}^k D_i = M, \displaystyle\sum_{i=1}^k x_i = n


Let's imagine a problem where we have 100 colored balls in a bag, from which 10 are red, 15 purple, 20 blue, 25 green and 30 yellow. Without looking into the bag, you take 30 balls out. How many balls of each color will you take from the bag?


We cannot model this problem using the multivariate distribution, because when we take the first ball out, the proportions of the different color balls in the bag change. The same happens when we take the second ball out and so on.


Thus, we must proceed as follows:


  • Model the first color (red for example) as x1= Hypergeometric(s1 ,N, M) , where s1 is the the type 1 population = 10, N is the sample size = 30, and M is the population size = 100
     

  • Model the rest as: xi = Hypergeometric (si, N - SUM(x1: xi-1), SUM(Di : Dn)) , where si is the the type i population, xi is the number of successes of the type i in a sample, xi-1 is the number of successes of the type i-1 in a sample, Di is the number of successes of type i in the total population, Dn is the number of successes of the last type in the total population.


The solution to this problem can be reached here: Multivariate Hypergeometric

The links to the Multivariate Hypergeometric software specific models are provided here:




  • No labels