The precision of a risk analysis relies very heavily on the appropriate use of probability distributions to accurately represent the uncertainty, randomness and variability of the problem. In our experience, inappropriate use of probability distributions has proven to be a very common failure of risk analysis models. It stems, in part, from an inadequate understanding of the theory behind probability distribution functions and, in part, from failing to appreciate the knock-on effects of using inappropriate distributions.
In this section we discuss five basic properties of distributions and how these properties should be used to select the distributions in your model. The five properties are:
- Discrete or continuous
- Bounded or unbounded
- Parametric or non-parametric
- Univariate or multivariate
- First and second order
Finally, we have put together a table with links for each distribution as it fits into each category.
Discrete and continuous distributions
The most basic distinguishing property between probability distributions is whether they are continuous or discrete. Nonetheless, the literature is plagued with examples where the discrete or continuous nature of a variable is overlooked when fitting data to a distribution.
A discrete distribution may take one of a set of identifiable values, each of which has a calculable probability of occurrence. Discrete distributions are used to model parameters like the number of bridges a roading scheme may need, the number of key personnel to be employed or the number of customers that will arrive at a service station in an hour. Clearly, variables such as these can only take specific values: one cannot build half a bridge, employ 2.7 people or serve 13.6 customers.
A continuous distribution is used to represent a continuous variable, i.e. a variable that can take any value within a defined range (domain). For example, the height of an adult English male picked at random will have a continuous distribution because the height of a person is essentially infinitely divisible. We could measure his height to the nearest centimeter, millimeter, tenth of a millimeter, etc. The scale can be repeatedly divided up generating more and more possible values.
Properties like time, mass and distance, that are infinitely divisible, are modeled using continuous distributions. In practice, we also use continuous distributions to model variables that are, in truth, discrete but where the gap between allowable values is insignificant: for example, project cost (which is discrete with steps of one penny, one cent, etc.), exchange rate (which is only quoted to a few significant figures), number of employees in a large organization, etc.
Bounded and unbounded distributions
A distribution that is confined to lie between two determined values is said to be bounded or truncated. A distribution that is unbounded theoretically extends from minus infinity to plus infinity. A distribution that is constrained at one or either end is said to be partially bounded. Unbounded and partially bounded distributions may, at times, need to be constrained to remove the tail of the distribution so that nonsensical values are avoided. For example, using a Normal distribution to model sales volume opens up the chance of generating a negative value. If the probability of generating a negative value is significant, and we want to stick to using a Normal distribution, we must constrain the model in some way to eliminate any negative sales volume figure being generated. Generally it's not a good idea to impose artificial bounds to a parametric distribution, so proceed with caution.
Most risk analysis software provide truncation of its distributions. For example Crystal Ball does this using the truncation grabbers or by typing the specific boundary manually in the distribution window, e.g. :
produces a Normal(100,10) distribution constrained between 70 and 120. The same result can be achieved in @RISK using the RiskTruncate function i.e. =RiskNormal(100,10,RiskTruncate(70,120))
One can also build logic into the model that rejects nonsensical values. For example, using the IF function: A2:=IF(A1<0,ERR(),0) only allows values into cell A2 from cell A1 that are >=0 and produces an error in cell A2 otherwise. Crystal Ball eliminates the error values from its analysis of the simulation results.
Note that imposing arbitrary truncations is different from distribution censoring. If you are faced with the problem of needing to constrain the tail of a distribution to avoid unwanted values, it is worth questioning whether you are using the appropriate distribution in the first place.
You will notice from the table below that only one of all the distributions is bounded on the right extreme; the MinimumExtreme distribution. If you need any of the other distributions to be right-bounded for some reason, you can also simply invert a left bounded distribution. For example: =-Weibull(0,5,2) produces a left-skewed (e.g. right-bounded) distribution with an unbounded minimum and a maximum of 0; =10-Gamma(0, 1.5, 2) produces a left-skewed distribution with an unbounded minimum and a maximum of 10, as shown in the figures below. Also, the model Fitting_ExtValue illustrates how to fit minimal data to an ExtremeValue distribution.
Parametric and non-parametric distributions
There is a very useful distinction to be made between model-based parametric and empirical non-parametric distributions. By 'model-based', we mean a distribution whose shape is borne of the mathematics describing a conceptual probability model. By 'empirical' or 'non-parametric' we mean a distribution whose mathematics is defined by the shape that is required. For example, a Triangular distribution is defined by its minimum, mode and maximum values. The defining parameters are features of the graph shape.
Those distributions that fall under the "empirical' or non-parametric class are intuitively easy to understand, extremely flexible and are therefore very useful. Model-based or parametric distributions require a greater knowledge of the underlying assumptions if they are to be used properly.
Parametric distributions should only be selected if either:
The theory underpinning the distribution applies to the particular problem;
It is generally accepted that a particular distribution has proven to be very accurate for modeling a specific variable without actually having any theory to support the observation;
The distribution matches the observed data very well indeed; or
One wishes to use a distribution that has a long tail extending beyond the observed minimum or maximum. These issues are discussed in more detail in the optional module on fitting distributions to data.
Univariate and multivariate distributions
Univariate distributions describe a single parameter or variable and are used to model a parameter or variable that is not probabilistically linked to any other in the model. Multivariate distributions describe several parameters whose values are probabilistically linked in some way. In most cases, we create the probabilistic links via one of several correlation methods. However, there are a few specific multivariate distributions that have specific, very useful purposes and are therefore worth studying more.
First or second order distribution
A probability or inter-individual variability distribution for which the parameters are precisely known is called a first-order distribution. A probability or inter-individual variability distribution for which there is some uncertainty about the parameters is called a second-order distribution. Thus, for example, Normal(100,10) is a first order distribution, whereas Normal(m,s) is a second order distribution if m and s are estimated and thus themselves carry uncertainty distributions. You cannot have a second-order distribution of uncertainty because you cannot have uncertainty about uncertainty – it collapses down to the one distribution of uncertainty, just the same as you cannot have a probability distribution of a probability distribution – it collapses down to a single probability distribution.
A plot of a first-order distribution is easy to understand. For example:
It's a bit more difficult to illustrate a second-order distribution. One needs to account for the parameter uncertainty, which is usually done by using a number of lines to reflect possible true distributions (sometimes called candy-floss or spaghetti plots):
The second-order cumulative plot is generally much clearer than its corresponding density plot.
Table of distributions
The table below gives an overview of the various distributions described in ModelAssist, so that you can most easily focus on which ones might be most appropriate for your modeling needs. Follow the links for an in-depth explanation of each. We have used the most common name for each distribution. If you are interested in a particular distribution whose name does not appear here, try using the search facility because many distributions have several names, or are recognized as simply special cases of other, more common distributions.
Left and right bounded
Orange indicates a multivariate distribution
Italics indicate non-parametric distributions