To learn more about EpiX Analytics' work, please visit our modeling applications, white papers, and training schedule.

Page tree

Picking the best model to represent your data is a common challenge in any statistical analysis, and the same is true for simulation modeling.  Visually comparing candidate models is very intuitive and recommended as a starting point, but plots don't provide an objective assessment of model fit so we recommend also using statistical measures to compare models. The most common such statistics are:

  1. Information Criteria (IC), which are based on Maximum Likelihood Estimation (MLE). Common IC include Akaike information criterion (AIC) and the Bayesian information criterion (BIC)
  2. Goodness of fit (GOF) statistics:
    1. The Chi-Squared statistic, which is only appropriate for fitting data to discrete distributions
    2. The Kolmogorov-Smirnoff (K-S) statistics, which should generally only be used for fitting continuous distributions and specifically, to fit the Normal distribution.
    3. The Anderson-Darling (A-D) statistic is also explained as a sophistication of the K-S statistic. 

Data can also be censored, for which use of goodness-of-fit statistics is more complex.

Information Criteria, are based on MLEs as well as the number of parameters that are used to fit the model to the data. IC's are very appealing because they measure how likely it is that the observed data came from the fitted distribution, while also favoring parsimony by penalizing models with more parameters. That is, everything else being equal, a model that has fewer parameters is preferable to a model that has many parameters.

The GOF statistics are not easy to understand or interpret, as they don't provide a true measure of the probability that the data actually comes from the fitted distribution. Instead, they provide a probability that random draws from the fitted distribution would have produced a goodness-of-fit statistic value as low as that calculated for the observed data. Not very intuitive, right?! 

Which statistic should you use

Most software will use GOF statistics for hypothesis testing (did the data come from this type of distribution?) and both GOF and IC for ranking of fitted models (distributions, copulas, or time series). We recommend that you are very selective about the distributions you choose to fit to your data to start with, so ranking is not so important. In addition, these statistics only test the degree of match between the MLE distribution and the data: it does not consider the effect of uncertainty about the parameters (and there may be a great deal of uncertainty). As a general rule, if you wish to use to pay some attention to GOF statistics:

  1. If the software reports IC's, then use those on a pre-selected candidate models that make sense.
  2. If you software does not calculate IC's:
    1. Use the Chi-squared statistic for discrete distributions
    2. Use the Anderson-Darling for continuous distributions fitting except for very exotic distributions, in which case the Kolmogorov-Smirnoff is better because it is distribution-independent

Finally, you may well choose to design your own statistic according to your own requirements. For example, you may only be concerned about achieving a close fit for large x-values in which case you could modify the A-D statistic to measure only the difference between the data and fitted distribution over the area of interest. Alternatively, you may only need a close fit at a couple of specific cumulative percentiles in which case you could simply sum the absolute errors at those points.

  • No labels