**Constructing an empirical distribution from data**

#### Situation

You have a set of random and representative observations of a single model variable, for example the number of children in American families (we'll look at a joint distribution for two or more variables at the end of this section), and you have enough observations to feel that the range and approximate random pattern has been captured. You want to use the data to construct a distribution directly.

#### Technique

Since the data already captures the pattern, one can simply use the empirical distribution of the data rather than fitting it to a parametric distribution. The main thing to keep on mind when using an empirical distribution is that extrapolating beyond the observed data can be difficult and subjective. Below, we outline three options to construct an empirical distribution:

1. Discrete Uniform: uses only the list of observed values

2. Cumulative: creates a cumulative distribution, and therefore allows values between those observed, and possibly values beyond the observed range

3. Histogram: similar to a cumulative distribution, but can be more efficient with large datasets.

**Option 1: A Discrete Uniform distribution**

How to construct a discrete uniform distribution varies for different simulation software packages. Model Empirical Distributions provides an example.

The links to the Discrete Uniform software specific models are provided here:

**Option 2: A Cumulative distribution**

If your data are continuous you also have the option of using a Cumulative distribution.

Our best guess of the cumulative probability of a data point in a set of observations turns out to be r/(n+1) where r is the rank of the data point within the data set and n is the number of observations. Thus, when choosing this option, one needs to:

__Rank the observations in ascending order;____In the column to the left of the observations, calculate the rank of the data: write a column of values 1, 2, … n;____In the column immediately to the right of the data, calculate the cumulative probability F(x) = rank/(n+1);____Use the data and F(x) columns as inputs to the distribution.__

Note that the minimum and maximum values of x only have any effect on the very first and last interpolating lines to create the Cumulative distribution, and so the distribution is less and less sensitive to the values chosen as more data are used in its construction.

Model Empirical Distributions provides an example.

The links to the Cumulative Distribution software specific models are provided here:

**Option 3: A histogram distribution**

Sometimes (admittedly, not as often as we'd like) we have enormous amounts of random observations that we would like to construct a distribution from (for example, the generated values from another simulation). The Discrete Uniform and Cumulative options described above start to get a bit slow at that point, and model the variable in unnecessarily fine details. A more practical approach now is to create a histogram of the data and use that instead.

Model Empirical Distributions provides an example.

The links to the Histogram Distribution software specific models are provided here:

**Creating an empirical joint distribution for two or more variables**

For data that are collected in sets (pairs, triplets, etc), there may be correlation patterns inherent in the observations, and that we would like to maintain while fitting empirical distributions to data. An example is data of people's weight and height, where there is clearly some relationship between them.

Model Empirical Distributions provides an example.

The links to the Empirical Joint Distribution software specific models are provided here: