If the observed data are continuous and reasonably extensive, it is often sufficient to use a cumulative frequency plot of the data points themselves (sometimes known as an ogive) to define the variable's probability distribution. The figure below illustrates an example with 18 data points.
The observed F(x) values are calculated as the expected F(x) that would correspond to a random sampling from the distribution, i.e. F(xi) = i / (n + 1) where i is the rank of the observed data point and n is the number of data points. An explanation for this formula is provided here. Determination of the empirical cumulative distribution proceeds as follows:
The minimum and maximum for the empirical distribution are subjectively determined based on the analyst's knowledge of the variable. For a continuous variable, these values will be outside the observed range of the data. The minimum and maximum values selected here are zero and 45.
The data points are ranked in ascending order between the minimum and maximum values.
The cumulative probability F(xi) for each xi-value is calculated as follows:
This formula maximises the chance of replicating the true distribution.
If there is a very large amount of data, it becomes impractical to use all of the data points to define the Cumulative distribution. In such cases, it is useful to batch the data first. The number of batches should balance fineness of detail (large number of bars) with the practicalities of having large arrays defining the distribution (lower number of bars). In this case, you can use the Histogram distribution as is illustrated in the example below.
The model NonParaCont1 illustrates two examples: the first where we use all the data to construct the Cumulative distribution; and the second where the number of data values are large, and so binned by equal percentiles (using a Histogram distribution).
The links to the NonParaCont1 software specific models are provided here: