# Fitting a continuous non-parametric second-order distribution to data

When we do not have a great deal of data, a considerable amount of uncertainty will remain about an empirical distribution determined directly from the data. It would be very useful to have the flexibility of using an empirical distribution, i.e. not having to assume a parametric distribution, and also to be able to quantify the uncertainty about that distribution. The following Bayesian technique provides these requirements.

Consider a set of n data values {xj} drawn from a distribution, and ranked in ascending order {xi} so xi<xi+1. Data thus ranked are known as the order statistics of {x}. We can use these order statistics to construct a second order empirical cumulative distribution

Here are the mathematics behind the technique:

Individually, each of the values of {xj} may map as a U(0,1) onto the cumulative probability of the parent distribution F(x). We take a U(0,1) distribution as the prior distribution for the cumulative probability for any value of x. We can thus use a U(0,1) prior for Pi = F(xi) for the value of the ith observation. However, we have the additional information that of n values drawn randomly from this distribution, xi ranked ith, i.e. (i-1) of the data values are less than xi and (n-i) values are greater than xi. Using Bayes' Theorem and the Binomial Theorem, the posterior marginal distribution for Pi can readily be determined, remembering that Pi has a U(0,1) prior and therefore a prior probability density = 1:

which is simply the standard Beta distribution Beta(i,n-i+1):

$//$                           (1)

Equation 1 could actually be determined directly from the fact that the beta distribution is the conjugate to the binomial likelihood function and that a U(0,1) = Beta(1,1). The mean of the Beta(in-i+1) distribution equals i / (n+1): a formula that has been used to estimate the best-fitting first-order non-parametric cumulative distribution.

Since Pi+1>Pi, these Beta distributions are not independent so we need to determine the conditional distribution f(Pi+1|Pi), as follows. The joint distribution f(Pi,Pj) for any two PiPj is calculated using the Binomial Theorem in a similar manner to the numerator of the equation for f(Pi|xii=1,n), i.e.:

where Pj>Pi and remembering that the prior probability densities for Pi and Pj equal 1 since they have U(0,1) priors.

For j=i+1:

The conditional probability f(Pi+1|Pi) is thus given by:

where k is some constant. The corresponding cumulative distribution function F(Pi+1|Pi) is then given by:

F(Pi+1|Pi)=1 at Pi+1=1, so k = (n-i) and the formula reduces to:

$//$    (2)

Equations 1 and 2 provide us with the tools to construct a non-parametric second-order distribution for a continuous variable given a data set sampled from that distribution. The distribution for the cumulative probability P1 that maps onto the first order statistic X1 can be obtained from Equation 1 by setting i = 1:

$//$                                   (3)

The distribution for the cumulative probability P2 that maps onto the first order statistic X2 can be obtained from Equation 2. F(Pi+1|Pi), being a cumulative distribution function, is Uniform(0, 1) distributed. Thus, writing Ui+1 to represent a Uniform(0, 1) distribution in place of F(Pi+1|Pi), using the identity 1-U(0, 1) = U(0, 1), and rewriting for Pi+1, we obtain:

$//$                (4)

which gives:

etc..

Note that each of the U2, U3, …Un Uniform distributions are independent of each other.

Recapping from the proof, the formulae from Equations 3 and 4 are:

$//$                                            (3)

$//$                         (4)

where Pj represents the estimate of the cumulative distribution function at xj, = F(xj). These can be used as inputs to construct a Cumulative distribution, together with subjective estimates of the minimum and maximum values that the variable may take, which can also be assigned subjective distributions.

Model NonParaCont2 creates a second-order distribution using this technique.

The links to the NonParaCont2 software specific models are provided here:

Note that in Crystal Ball, as F(x) changes from iteration to iteration, the Custom Distribution therefore needs to be capable of dynamic referencing.

##### Using this technique in a second order model

Use Crystal Ball's 2-D simulation Tool, as also illustrated here:

• Select "Run" >> "Tools" >> 2D Simulation;

• In Step 1 of 3, select "Forecast" (Cell I14) as target;

• Click Next

• In Step 2 of 3, select the assumption "Distribution (Model)" as Variability (all other assumptions are uncertainty)

• Click Next

• If you run the outer (uncertainty) simulation for 10 trials and run inner (variability) simulation for 1000 trials, this will result in the following second order model:

In @Risk the inputs to RiskCumul can be formulated as: =RiskCumul(min, max, {xj}, {Pj})

##### Using this technique in a second order model

• The uncertainty distributions for F(x) are nominated as outputs;

• A smallish number of iterations are run;

• The resultant data are exported back to a spreadsheet;

• Those data are then used to perform multiple simulations (the ”outer loop”) of uncertainty using @RISK’s RiskSimtable function: the ”inner loop” of comes from the Cumulative distribution itself

##### Limitations

There are a few limitations to this technique. In using a Cumulative distribution, one is assuming a histogram style probability distribution function between each of the {x} values. When there are a large number of data points, this approximation becomes irrelevant. However, for small data sets the approximation will tend to accentuate the tails of the distribution: a result of the histogram "squaring-off" effect of using the Cumulative distribution. In other words, the variability will be slightly exaggerated. However, the squaring effect can be reduced, if required, by using some sort of smoothing algorithm and defining points between each observed value. In addition, for small data sets, 'the tails' contribution to the variability will often be more influenced by the subjective estimates of the minimum and maximum values: a fact one can view positively (one is recognizing the real uncertainty about a distribution's tail), and negatively (the smaller the data set, the more the technique relies on subjective assessment).

The fewer the data points, the wider the confidence intervals will become quite naturally and, in general, the more emphasis will be placed on the subjectively defined minimum and maximum values. Conversely, the more data points available, the less influence the minimum and maximum estimates will have on the estimated distribution. In any case, the values of the minimum and maximum only have influence on the width (and therefore height) of the end two histogram bars in the fitted distribution. The fact that the technique is non-parametric, i.e. that no statistical distribution with a particular cumulative distribution function is assumed to be underlying the data, allows the analyst a far greater degree of flexibility and objectivity than that afforded by fitting parametric distributions.

• No labels