You have a set of random and representative observations of a single model variable, for example the number of children in American families (we'll look at a joint distribution for two or more variables at the end of this section), and you have enough observations to feel that the range and approximate random pattern has been captured. You want to use the data to construct a distribution directly.
It is unnecessary to fit a distribution to the data: instead Since the data already captures the pattern, one can simply use the empirical distribution of the data (if there are no physical or biological reasons a certain distribution should be used, we generally prefer an empirical distribution)rather than fitting it to a parametric distribution. The main thing to keep on mind when using an empirical distribution is that extrapolating beyond the observed data can be difficult and subjective. Below, we outline three options you have to use this data to construct an empirical distribution:
2. Cumulative: creates a cumulative distribution, and therefore allows values between those observed, and possibly values beyond the observed range;
3. Histogram: when you have huge amounts of datasimilar to a cumulative distribution, but can be more efficient with large datasets.
Option 1: A Discrete Uniform distribution
Our best guess of the cumulative probability of a data point in a set of observations turns out to be r/(n+1) where r is the rank of the data point within the data set and n is the number of observations. Thus, when choosing this option, one needs to:
- Rank the observations in ascending order;
- In the column to the left of the observations, calculate the rank of the data: write a column of values 1, 2, … n;
- In the column immediately to the right of the data, calculate the cumulative probability F(x) = rank/(n+1);
- Use the data and F(x) columns as inputs to the distribution.
Note that the minimum and maximum values of x only have any effect on the very first and last interpolating lines to create the Cumulative distribution, and so the distribution is less and less sensitive to the values chosen as more data are used in its construction.