To learn more about EpiX Analytics' work, please visit our modeling applications, white papers, and training schedule.

Page tree



In risk analysis, we often attempt to predict the results of a set of random trials, where the trials can result in either a "success' (the outcome we are most interested in) or a "failure'. For example:


  • How many airplane flights will result in crashes;

  • How many people eating hamburgers will get E.coli infections;

  • How many sales pitches will result in a sale;

  • How many people entering a shop will make a purchase;

  • How many women receiving fertility treatment will become pregnant;

  • How many cars will need a replacement engine within the guarantee period; etc.


Independent trials


If each of these n trials is independent (meaning the result of each trial is not influenced by the result of any previous trial), and if all trials have the same probability of success, the outcome of these trials conforms to a binomial process. Moreover, we can model the number of successes s using a Binomial distribution:


s = Binomial(p,n)


A binomial process is a random counting system where there are n independent identical trials, each one of which has the same probability of success p, which produces s successes from those n trials (where 0≤s≤n and n > 0 obviously). There are thus three parameters {n, p, s} that between them completely describe a binomial process.

To model the number of successes in a certain number of trials we will use:


s = Binomial(p,n)


where n is the number of trials and p is the probability of the trial becoming a success.


The simplest example of a binomial process is the toss of a coin. If I toss a fair coin (a coin with 50% probability of returning either head or tail) 10 times, what is the distribution of the number of heads I will get? The answer to this question can be modeled using just one formula:


s = Binomial(0.5,10) , which will produce the following outcome distribution:



As expected, the most likely value and the mean number of successes equal 5.


Problem: What is the probability that I will have no heads at all?


As we see from the graph above, this probability is very low, so is better not determined by simulation as we would need very many iterations to get an accurate answer. However, the probability could be calculated using Excel's BINOMDIST function:


P(s=0) = BINOMDIST(0,10,0.5,0) = 0.000976563


Problem: What is the distribution of the maximum number of heads I can get in a row by tossing a fair coin 10 times?


The solution to this problem is provided in the model: Coins 


The links to the Coins software specific models are provided here:

  coins












Column C (C8:C17) shows which of the tosses became successes (heads) using a Binomial distribution with the number of trials equal to 1. Column D (D8:D17) calculates the number of heads in a row by adding 1 to the running total if the corresponding cell in column C equals 1, or setting the running total to 0 otherwise. The outcome of the model is calculated in cell D19, which is just a maximum of the values in column D.

  coins












Column C (C8:C17) shows which of the tosses became successes (heads) using a RiskBinomial function with the number of trials equal to 1. Column D (D8:D17) calculates the number of heads in a row by adding 1 to the running total if the corresponding cell in column C equals 1, or setting the running total to 0 otherwise. The outcome of the model is calculated in cell D19, which is just a maximum of the values in column D.


Hypergeometric process


The hypergeometric process occurs when one is sampling randomly without replacement from some population, and where one is counting the number in that sample that have some particular characteristic. In this situation we have four parameters: M, the population size; D, the number of successes in the population; n, the sample size; and s, the number of successes in the sample. We do not have a probability of success parameter (p) here since it is changing because the proportion of successes in the remaining population is changing with every sample we take.


Let's imagine a pack of playing cards (52 cards, without jokers), and 13 of them are hearts. If we consider a card of hearts to be a success, then we have: M = 52 and D = 13. If we are to pick a single card from the pack at random, the probability of picking a heart is equal to 13/52 = 1/4.


What is the probability of picking two hearts in a row? Well, unlike in the binomial process, we cannot just multiply the probability of 1/4 by itself, since the probability of the second card being a heart depends on the suit of the first card:

1. if the first card was a heart (a success), then there are only 12 hearts remaining in the pack and the probability of picking another heart reduces to 12/51

2. if the first card was not a heart (a failure), then all 13 hearts are still in the pack and the probability of picking a heart increases to 13/51

Excel's HYPGEOMDIST(s,n,D,M) function calculates the probability of picking 2 hearts from two trials:


P(2 hearts) = HYPGEOMDIST(2,2,13,52) = 0.058823529


In general, direct calculation of a probability is to be preferred over simulation because it is faster and more accurate, however, probability problems quickly become far too complex for us to calculate and we then resort to simulation.


Let's now consider the following problem: I have three full packs of cards, and I draw 10 cards from each pack. What is the probability that I draw at least 10 hearts in total? This problem is more complicated as there are many combinations of samples from each pack that would give the required result, so we resort to simulation.

The solution to this problem is provided in the model: Cards.

The links to the Cards software specific models are provided here:

  cards









In the model, shown in the figure above, cells B8:D8 calculate the number of successes (hearts) in 10 trials for each pack. Cell F8 checks the total number of successes and returns 1 if the value is 10 or more and zero otherwise. Cell G8 uses Crystal Ball's CB.GetForeStatFN(x,2) function to calculate the mean of the distribution in cell F8. Since the cell F8 consists of either 0 or 1 values, the mean of this distribution is equivalent to the probability of cell F8 returning a value of 1.

  cards



In the model, shown in the figure above, cells B8:D8 calculate the number of successes (hearts) in 10 trials for each pack. Cell F8 checks the total number of successes and returns 1 if the value is 10 or more and zero otherwise. Cell G8 uses @RISK's RiskMean function to calculate the mean of the distribution in cell F8. Since the cell F8 consists of either 0 or 1 values, the mean of this distribution is equivalent to the probability of cell F8 returning a value of 1.



Mixed processes


Let's now look at an example where we have several stochastic processes working together.


20 people are invited to a party, where some wine, unknowingly contaminated with bacteria is being served. Let's say that there are on average 7 bacteria in one litre of wine, and one glass contains 150 ml of wine. 12 out of 20 people decided to drink one glass of wine each.

Let's also say that 7 people at the party are allergic to this particular bacteria, so if they consume one or more particles they have a higher chance of getting ill than those who are not allergic. The following table shows the probability of getting ill for both allergic and non-allergic people:


P(infection)

Bacteria consumed

Allergic

Not allergic

0

0

0

1

0.8

0.4

2+

1

0.6


So if for example one bacterium is consumed by someone allergic, then he has an 80% chance of getting infected and if he consumes  2 or more bacteria, he has a 100% probability of getting ill.

The problem is to work out how many people are going to get ill.

Spreadsheet Wine Problem shows a solution to this problem.


The links to the Wine Problem software specific models are provided here:

  wine_problem




















In this model the cell F16 is calculating the number of allergic people that drink a glass of wine. Since we have only 20 people at the party, we cannot model this variable as Binomial(7/20, 12), as the probability of the next person being allergic changes. Instead we must use the hypergeometric distribution: N(allergic) = Hypergeo (7/20, 12, 20)


There are two tables at the bottom part of the spreadsheet, which calculate the number of infected allergic and non-allergic people. These tables follow similar logic, with the only difference being that the left table uses the probabilities of infection for the allergic people, and the right table uses probabilities for non-allergic people. The first row for both tables are zeros since if there are 0 people drinking wine,  0 people will get infected.


Column "Dose" uses a Poisson distribution to calculate the number of bacteria in each person's glass. The Poisson intensity parameter l equals 7 * 150 / 1000. By using the Poisson distribution, we are assuming that there is a very large volume of wine to which the 7 bacteria/litre statistic applies.


Column "P(inf)" shows the probability of infection given the number of bacteria. It uses the VLOOKUP function to draw the needed values from the table "P(infection)" in the upper right corner of the spreadsheet.


Column "Infected sum" uses the binomial distribution with the number of trials set to 1 and the probability of success set to the corresponding cell in column "P(inf)" to model each individual's fate.


The output cell (K16) looks up the required number of infected and non-infected people in the two tables, and adds them together, thus producing the required answer.

  wine_problem



In this model the cell F16 is calculating the number of allergic people that drink a glass of wine. Since we have only 20 people at the party, we cannot model this variable as Binomial(12 , 7/20), as the probability of the next person being allergic changes. Instead we must use the hypergeometric distribution: N(allergic) = Hypergeo (12, 7, 20)


There are two tables at the bottom part of the spreadsheet, which calculate the number of infected allergic and non-allergic people. These tables follow similar logic, with the only difference being that the left table uses the probabilities of infection for the allergic people, and the right table uses probabilities for non-allergic people. The first row for both tables are zeros since if there are 0 people drinking wine,  0 people will get infected.


Column "Dose" uses a Poisson distribution to calculate the number of bacteria in each person's glass. The Poisson intensity parameter l equals 7 * 150 / 1000. By using the Poisson distribution, we are assuming that there is a very large volume of wine to which the 7 bacteria/litre statistic applies.


Column "P(inf)" shows the probability of infection given the number of bacteria. It uses the VLOOKUP function to draw the needed values from the table "P(infection)" in the upper right corner of the spreadsheet.


Column "Infected sum" uses the binomial distribution with the number of trials set to 1 and the probability of success set to the corresponding cell in column "P(inf)" to model each individual's fate.


The output cell (J16) looks up the required number of infected and non-infected people in the two tables, and adds them together, thus producing the required answer.



Clustered trials


Sometimes, the trials are not independent of each other but grouped together in fixed, or variable, sized groups, or clusters. For example, the number of airline passengers that might die in a year from a plane crash is strongly grouped because they are very likely to suffer the same fate if they are in an aircraft together that crashes. Other examples include:


  • How many people in a village get divorced (they are paired) in a year;

  • How many infected blood samples are mis-diagnosed by a laboratory (if, for example, a lab tests samples in batches and makes a mistake with a batch);

  • How many manufactured items in a consignment fail to meet the required tolerance (if, for example, a machine is not set up correctly for a production run).


In such situations, one can often model the group using a Binomial distribution. Then, if the number of individuals in the group is constant (say takes a value k), the number of successes is =k*Binomial(p,n). So, for example, if we had 290 married couples in the village, and believed a married couple had a 3% probability of divorcing in a year, we would estimate seeing 2*Binomial(3%,290) divorcees next year.


Alternatively, if the number in a group is variable, we need to create a model that sums a variable number of random variables. For example, imagine that we consider that there is a 5% chance of incorrectly setting up a machine to produce widgets resulting in a bad batch of out-of-tolerance widgets, and that we set up 10 machines on a day's production run, but that each machine will produce Poisson(250) widgets, what fraction of my production will be out of tolerance? Model Widgets provides the answer.

The links to the Widgets software specific models are provided here:


Probability randomly varies for the set of trials as a whole


If the probability of a trial becoming a success is a random variable itself, the resultant distribution of the number of successes is wider than a Binomial distribution. For example, the probability that the sheep in a flock will survive the winter depends on whether the winter is particularly harsh. All of the sheep will endure the same conditions, so if the probability is higher for one, it is higher for all. Therefore, you could argue that the probability is variable, but the same for each sheep. A convenient way of modeling a probability that varies is to use the Beta distribution. We could also model the number of successes as:


s = Binomial(Beta(a,b,1), n)


where a and b are two parameters used to create the required shape for the Beta distribution. In fact, this distribution for s is known in probability theory as the Beta-Binomial distribution. The Beta-Binomial distribution always has a greater spread than its most closely matching Binomial distribution.


But how about scenarios where the probability of a trial's success is a random variable but each trial's probability is independent of the others? You might think that we would need to model the probability for each trial separately as a random variable, for example as shown in the following spreadsheet: Indepprob.


But, in fact, we just need to use the following formula:


s = Binomial(a/(a+b), n)


where a/(a+b) is the mean of a Beta(a,b) distribution. The above model runs the calculation both ways for you to compare, but can you work out why this is true?


The links to the Indepprob software specific models are provided here:





  • No labels