The motivation
The slightly deeper understanding we want to offer here of what you are actually doing when performing a Monte Carlo simulation will allow you to expand and improve your use of the software and Excel greatly: to find shortcuts that improve a model's accuracy and efficiency; to perform many complex statistical analyzes in a simple and intuitive way; perform mathematical integrals (admittedly a less common task for most of us); and it will help you understand and explain risk analysis models much better.
The principle
Numerical integration encompasses a broad range of techniques that replace algebraic integration with simulation: the relative frequency of the result of some mathematical operation performed on randomly generated values is developed that would be the equivalent of performing the integral of some function. It's a horrible definition - the best explanation is through examples:
The most common use of a simulation software package, viewed from a numerical integration perspective
Comparison of two or more random variables or uncertain parameters
Using simulation to give us the confidence that one parameter is larger than another, for example, or the probability that one random variable will be greater than another.
Mixing calculation and simulation
You will frequently realize that some parts of a risk model could be calculated directly rather than simulated, while other parts are too complicated and one must therefore resort to simulation. Numerical integration allows us to combine the simulated and calculated parts of a model.
Area of a curve, volume of a 3-D body, etc.
Not perhaps a technique you will often need, but interesting nonetheless, and useful for some engineering and scientific applications. In mathematics, we often need to evaluate a definite integral (meaning an integral that gives a numeric rather than algebraic answer). It is a very simple matter to do this in the simulation software package provided we can write the function to be integrated.
Monte Carlo simulation
The Monte Carlo simulation modeling that one would normally carry out is an example of numerical integration. For example it is quite simple to calculate the probability distribution function of the sum of two independent distributions:
Let X be the first random variable with cumulative distribution function FX(x), and let Y be the second distribution with density g(y). Then the cumulative distribution function of the sum of X and Y, FX+Y, is given by:
F_{X+Y}(a)=\int_{-\infty}^\infty F_x(a-y)g(y)dy |
The sum of two independent distributions is sometimes known as the convolution of the distributions.
For all but the simplest combinations of distributions this integral is very complicated or even impossible to determine algebraically. Monte Carlo simulation replaces the integral with the relative frequency with which values are generated. So, for example, we create a model with two cells, each with a distribution representing variable X and Y, and a third cell that adds together the generated values for X and Y. The cumulative distribution function of the sum of X and Y, FX+Y, is replaced in simulation by the empirical cumulative distribution of the generated values in this third cell.
Comparison of two or more random variables or uncertain parameters
Monte Carlo simulation gives you an easy way to compare two or more comparable random variables, or two or more comparable uncertain parameters. The technique for comparing uncertain parameters is as follows:
1. Construct a distribution for each uncertain parameter. For example:
A1: Uncertainty distribution for parameter X
A2: Uncertainty distribution for parameter Y
A3: Uncertainty distribution for parameter Z
where, of course, X, Y and Z are comparable: i.e. they have the same units (e.g. km/h) and refer to the same characteristic (e.g. maximum car speed of model X,Y,Z)
2. Construct a logical function that makes the required comparison between the values, and returns a 1 if true, and a 0 otherwise.
For example, in a simulation software package that uses Excel, we might want to know which car is fastest, so we would write:
B1: =IF(A1=MAX(A1:A3),1,0) for car X
B2: =IF(A2=MAX(A1:A3),1,0) for car Y
B3: =IF(A3=MAX(A1:A3),1,0) for car Z
Or we might wish to know whether a car is at least 10% faster than the other two:
B1: =IF(A1>MAX(A2,A3)*1.1,1,0) for car X
B2: =IF(A2>MAX(A1,A3)*1.1,1,0) for car Y
B3: =IF(A3>MAX(A1,A2)*1.1,1,0) for car Z
3. Count the fraction of iterations in a simulation of the model that generate a 1 for this test:
C1: mean(B1,2) for car X
C2: mean(B2,2) for car Y
C3: mean(B3,2) for car Z
The values that appear in these cells after a simulation run are the confidence we have that each car meets the tested requirement (e.g. fastest, fastest by 10%, etc.)
Comparing two or more comparable random variables is performed in exactly the same way, except that the distributions in A1:A3 would be probability distributions, and the values in C1:C3 would be probabilities.
The direct relationship between this type of simulation model and an algebraic integration is discussed in two examples:
Mixing calculation and simulation
As a general rule, it is much better to be able to create a probability model that calculates, rather than simulates, the required probability or probability distribution. Calculation is preferable because the model answer is updated immediately if a parameter value changes (rather than requiring a re-simulation of the model) and more importantly within this context, it is far more efficient.
For example, imagine that you have determined that a raw egg has a 0.2% probability of being contaminated with Salmonella, that there is a 3.5% probability that a random egg would be consumed raw, a contaminated raw egg contains Poisson(1)(50) bacteria, and that the probability a person would be ill is (2) =1-Exp(-bacteria/5308)^-0.4059. You are asked to calculate the probability that a random egg will cause someone to become ill.
Given a rather large number of simplifying assumptions that we don't need to go into here, you could obtain this probability by simulation, calculation, or something in-between. Algebraically, the probability that a random egg would cause illness is:
p(ill)=\text{0.2%*3.5%} * \displaystyle\sum_{D-1}^{\infty} \bigg(\frac{e^{-500}500^D}{D!} \bigg[ 1- \frac{D}{5308}\bigg] ^{-0.4059}\bigg) |
This equation calculates (the probability that the egg is contaminated) * (the probability the egg is consumed raw) * (the probability of having D bacteria and then getting ill from D bacteria, summed over all D). The component
\frac{e^{-500}500^D}{D!} |
is the Poisson probability of a dose of exactly D bacteria.
modeling approaches
As an example, in the model Raw Egg, the dose for any random egg is simulated, and will for (1-0.2%*3.5%) = 99.993% of iterations produce a scenario where the egg is not infected and/or is not consumed raw. We already know that the probability of illness is then zero, so this is a very inefficient model.
The links to the Raw Egg software specific models are provided here:
In this model, only the Poisson(50) dose is simulated. The dose is then fed into the dose-response model to give a probability of illness given that simulated dose were consumed. The outcome cell therefore generates the different possible probability of illness P(ill|Dose) values, and the relative frequency with which these values occur is P(Dose). Thus the mean of this cell is:
\displaystyle\sum_{D-1}^{\infty} \bigg(\frac{e^{-500}500^D}{D!} \bigg[1- \frac{D}{5308}\bigg] ^{-0.4059}\bigg) |
Another example is provided in the Hyperparameter topic, where the effect of a Hyperparameter in a likelihood function of a Bayesian inference is integrated out.
If we consider the calculation model only, the model calculates the entire probability equation, giving P(ill) = 2.67122E-07 or about 1 in 4 million. In order to get a reasonably accurate estimate of this probability using simulation only, one would have to run at least 100,000,000 iterations!
Notes:
(1) A Poisson distribution would not be used in reality in this situation - a LogSomething distribution would be more reflective of the exponential type growth exhibited by bacteria
(2) This is a low virulence approximation to the beta_Poisson dose responce model. The parameter values are widely disputed, so treat the equation as just an example.
Area of a curve, volume of a 3-D body, etc.
As already mentioned, this use of numerical integration is not of great importance in risk analysis, but it is interesting and good practice for Monte Carlo modeling. The technique is described in detail in a separate topic.