To learn more about EpiX Analytics' work, please visit our modeling applications, white papers, and training schedule.

Page tree


Generating your own distribution when you know the cdf, pdf or pmf


Situation:

You wish to use a parametric probability distribution that is not available in your software, and you know:


The cumulative distribution function (continuous variable);

The probability density function (continuous variable); or

The probability mass function (discrete variable).


This section describes the technique for each situation.



 

Known cumulative distribution function (cdf)

This method applies when you know the cdf of a continuous probability distribution. The algebraic equation of the cdf can often be inverted to make x the subject of the equation. For example, the cdf of the Exponential probability distribution is:



F(x)=1-e^{-x/\beta}

                                                                         (1)


Where b is the mean of the Exponential distribution. Inverting equation 1 to make x its subject gives:



x= -\beta ln(1-F(x))

                                                                         (2)


A random sample from any continuous probability distribution has equal probability of lying in any equally sized range of F(x) between zero and one. For example, the variable X has a 10% probability of producing a value between x1 and x2 (x2>x1) where F(x2)-F(x1) = 0.1. Looking at it the other way around, F(x) can be thought of as being a Uniform(0,1) random variable. Thus, we can write Equation 2 as a general formula to generate values from an Exponential distribution with a mean of 23 as follows:


= -23*LN(1-Uniform(0,1))                           (3)


or [can you see why?]:


= -23*LN(Uniform(0,1))                              (4)


This is exactly equivalent to writing:


=Exponential(1/23)


in a cell. [Try it if you like]. Since we have used one distribution to generate another, if we use  Latin Hypercube Sampling (LHS) to draw from the Uniform(0,1) distribution, we will get Latin Hypercube samples from the constructed Exponential distribution too. Similarly, if we force the Uniform distribution to sample from its lower end – a technique necessary for advanced sensitivity analysis and stress analysis for example, Equation 3 (but not Equation 4) will generate values from the lower end of the Exponential distribution.


Disadvantage

The method hinges on being able to perform the algebra that inverts the cdf. If that proves too difficult, it may be easier to construct the cdf with a Cumulative distribution.

Known probability density function (pdf)

We may sometimes start with the pdf for a variable, determine the cdf by integration, and then use the first method although this generally requires good maths skills. Another reason one might like to determine the cdf could be an interest in the probability of the variable being below, between or above certain values. Integration of the pdf will give the cdf which can give you these probabilities. The example below shows you how to determine a cdf, starting with a pdf, in this case of a sine curve distribution.


Imagine that you want to design a distribution that followed the shape of a sine curve from 0 to a, where a is an input to the distribution. This distribution shape is shown in Figure 1.

Figure 1: The sine curve distribution we wish to create in our example


The probability density function f(x) is given by:



f(x)=b \sin(\frac{\pi x}{a})


where b is a constant to be determined so that the area under the curve equals one, as required for a probability distribution.


The cdf F(x) is then (0<x<a):


               

F(x)=b\int\limits_0^x{sin(\pi x/a)dx}




=\Big[(ab/ \pi)\left\{-\cos(\pi x/a)\right\}\Big]_0^x


                      

=(ab/ \pi)(1-\cos(\pi x/a))

                

                                                                                 (1)


For the area under the curve to equal one, b must be determined such that F(a) = 1, i.e.:


               

F(a)=(ab/ \pi)(1-\cos(\pi))=2ab/ \pi=1



Therefore,


               

b=\pi /2a



and, from Equation 1, F(x) becomes:


               

F(x)=\frac{1}{2}(1-\cos(\pi x/a))



We now need to find the inverse function to F(x). So, rearranging the equation above for x:


               

x=(a/ \pi)\cos^{-1}(1-2.F(x))



To generate this distribution, we put a Uniform(0,1) distribution in cell A1 (say), the value for a in cell B1 (say) and, in the cell that generates values of x, we write:


                = B1/PI() * ACOS (1-2*A1)   (The Excel function ACOS(y) returns cos-1(y) )


If we use Latin Hypercube Sampling to generate the values from the Uniform(0,1) distribution, we will generate a smoothly distributed set of x-values that will replicate the desired distribution.



Known probability mass function (pmf)

If you know the pmf of a distribution, it is a simple matter (in principle) to create the distribution in your software. The techniques above are not applicable because a cdf for a discrete variable is just the sum of the discrete probabilities, and it is thus not possible to construct an inverse transformation.


The method requires that you use two columns of data in Excel:


The first column is a set of value that the variable might take, e.g. {0,1,...99,100}

The second column is a set of probabilities using the pmf calculated for each of these values {p(0),p(1),...p(99),p(100)}


These two arrays are then used to construct the required distribution using a discrete distribution, such as the one implemented in Crystal Ball's Discrete distribution.


Of course, this method can become cumbersome if the {x} array is very large. In which case:


  1. Make the {x} list with spacing larger than 1, e.g. {0,5,10,..495,500};

  2. Calculate the associated probabilities {p(0),p(5),...p(495),p(500)}

  3. Construct a General distribution.

  4. In the cell in which you need the final distribution, wrap a ROUND function around the cell containing the General distribution; =ROUND(A5,0) to return discrete value

  5. Note that using a minimum = -0.5 (x1-0.5) and a maximum 500.5 (xn=0.5) will allocate a more accurate probability to the end values.





  • No labels