To learn more about EpiX Analytics' work, please visit our modeling applications, white papers, and training schedule.

Page tree

Lightning strikes, car accidents, machine failures, political crises, disease outbreaks – are all random events in time that can be thought of as independent of each other. Daisies on a lawn, bacteria in a liquid, mold in a silo, diamonds in a rock – can all be thought of as random events in either two (surface) or three (volume) dimensional space.


The most common approach in modeling a distribution of how many of these events a might occur in a given amount of time or space t is to assume that the counts are from a Poisson process, in which case the counts will take a Poisson distribution:


Counts α = Poisson(λ*t)  


where λ is the mean (expected) number of events that would occur per unit t. Care needs to be taken with the units of λ and t to ensure that they match.  The product λ*t is the expected number of events over the period t and is sometimes called the Poisson intensity.


In a Poisson process, there is a continuous and constant opportunity for an event to occur. The Poisson process could be applied to approximate the number of events that occur both in time and space. For example, if there is a constant opportunity that a person receives an e-mail with a virus, then the number of viruses a that the person will receive within a year could be modeled as:


α = Poisson (λ * t )


where λ is the number of infected e-mails per period of time (one day for example), and t - is the number of periods (365 days in this case).


The same applies to the number of events in space. If bacteria were randomly distributed in a vat of liquid, and not dying or multiplying, the number of bacteria consumed a by drinking from that vat would follow a Poisson process, where the measure of exposure would be the amount of liquid consumed:


α = Poisson (λ * t ), where λ is the number of bacteria per unit of space (one cm3 for example), and t - is the amount of cubic centimeters of liquid consumed by a person.


The simulation software package provides a Poisson distribution with just one parameter – the Poisson intensity which they call λ. We prefer to separate λ (as the expected counts per unit exposure) and t (the amount of exposure) as it helps to avoids some common confusion over units, but the result is the same.


Excel offers a function POISSON( ) that calculates probabilities of the required Poisson distribution. This could cause confusion as the Excel function and the distribution have the same name, but they are easy to distinguish as their formats are quite different:


=POISSON(x, λ, 0/1_toggle)

=Poisson(λ)


The Poisson(λ) distribution in the simulation software package returns random numbers that are integers, while the POISSON(x, λ, 0/1_toggle) function in Excel returns a probability.


Two more useful formulae: from the probability mass equation for the Poisson distribution, we get:


Probability of zero counts = EXP(-λ*t) = POISSON(0,λ*t,0)

Probability >zero counts = 1- EXP(-λ*t) = 1-POISSON(0,λ*t,0)



Examples of Poisson count modeling


Example 1


An insurance policy has sold 25,000 car insurance policies for the next year. Last year, there were 0.046 claims/car insured/year, and this rate is expected to continue for next year. How many insurance claims will there be next year:


Ignoring our uncertainty about λ (see here), we would proceed as follows:


λ  = 0.046 claims/car insured/year

t = 25 000 insured car years


Claims = Poisson(25,000 * 0.046) = Poisson(1200)

This Poisson distribution is almost exactly a Normal(λ, √λ) – see Central Limit Theorem for an explanation. It is interesting to note that there is not much randomness about the 10 350 value: the distribution varies between about 10,000 and 10,700. This is because a Poisson distribution has a standard deviation equal to the square root of its mean, so a Poisson(1200) distribution has a standard deviation of about 35: less than 3% of its mean. If the insurance company had sold just a few policies (around 1000), so it could expect 50 claims in a year for example, the standard deviation would have been about 7, or about 14% of the mean. The stability that comes from large numbers of claims enables insurance companies to accurately predict their expenditure and therefore offer very competitive policies.


Example 2

In a factory, it is estimated that there will be about 0.3 injuries/person/1000 hours worked. There are 20 machines, each manned by one person. The machines run 12 hours/day, 250 days per year. How many injuries will there be in the next year?


λ = 0.03 injuries/person/1000 hours worked = 0.00003 injuries/person hour worked

t = 20*12*250 = 60000 person hours of work


Injuries = Poisson(0.00003*60000) = Poisson(1.8)

The probability there will be no injuries next year = EXP(-1.8) ≈ 17%, as seen in the graph above (the high of the first column), and thus the probability of at least one injury in the year is 83%.



Extension to the model to account for seasonality


The rate at which things occur in time is often seasonal. For example, car accidents occur more during rush hour (a daily seasonality), the beginning of summer holidays (a yearly seasonality), Monday to Friday when people work (a weekly seasonality), etc. If the time being considered means that any seasonality will be averaged out, we can ignore it. For example, if we wanted to estimate insurance claims next year, all daily, weekly and yearly seasonality are averaged out in a yearly estimate of λ. However, if wanted to forecast for just one month, it would be important to know which month: is it a winter month, in which case there may be more accidents due to snow or ice, etc.


There is a probability identity that says that Poisson(a) + Poisson(b) = Poisson(a+b). So, if we calculate lambdas for, say, n periods each of length t, the counts for the total n*t period can be modeled as:


=Poisson(λ1t+ λ2t + …+ λnt) = Poisson(t*    \displaystyle\sum_{i=1}^n \lambda )


In other words, we simply have to sum the λs for each period to get a total λ for the entire n*t period and use one Poisson distribution to model the total counts.


Extension to the model to account for "over-dispersion'


Sometimes historic observations in equal length periods have a wider distribution than a fitted Poisson distribution would suggest. If that "over-dispersion" (i.e. data fitted to a hypothetical, fitted distribution, have a greater spread than the fitted distribution can account for) cannot be accounted for by seasonal variations in the Poisson intensity, then it could be that the Poisson intensity is randomly varying itself. A rather neat result is that, if one models the random variation of λ as a Gamma(0,b,α) distribution, the resultant Poisson distribution becomes a Negative Binomial distribution. So:


=Poisson(Gamma(0,β,α)) is the same as = NegBinomial(1/(1+β),α) - α


This is very useful because a Gamma distribution can take a wide variety of right-skewed distributions: from an Exponential, through a Lognormal type of shape, to a Normal distribution, so the NegBinomial gives us quite some flexibility.



Example 3: A Poisson variable mixed with another distribution.


Clumps of cysts are randomly distributed in water with the average of 0.3 clump per litre. Each clump can contain from 1 to 5 cysts, with each value having the same probability of occurring. Each cyst has a 20% probability of infecting a person. If I drink 12 litres of water, how many cysts will I consume? What is the probability of my becoming infected from drinking 12 liters of water?

Model Cysts in water shows the solution to the problem.


The links to the Cysts in water software specific models are provided here:

  Cysts_in_water


Here is a screenshot of the model:



Column J (cells J12:J24) returns the number of cysts for each clump. Our answer to the first question will then be a summation of all values in column J (displayed in cell F16). Notice that the number of clumps in the figure above is 5 (cell I8), so only the first five cells of the column I have positive values; the rest of the cells have the value of zero.

  Cysts_in_water


Here is a screenshot of the model:



Column I (cells I13:I25) returns the number of cysts for each clump. Our answer to the first question will then be a summation of all values in column I (displayed in cell F17). Notice that the number of clumps in the figure above is 5, so only the first five cells of the column I have positive values; the rest of the cells have the value of zero.


The number of clumps I will consume by drinking 12 liters of water is simulated in this model. We use the Poisson distribution with λ = 0.3 and t = 12. To calculate the total number of cysts that I will consume, we need to model the number of cysts for each clump separately. The common mistake here is to model the number of cysts n as:


n = Poisson(λ * t) * Round(Uniform(0.5,5.5))


This is wrong because the equation says that all clumps have the same number of cysts. This will result in excessive spread of the estimate for n. The correct way to model this variable is to model each clump separately and then sum the required number of clumps.


To calculate the probability of my getting ill, we proceed as follows: The probability that a single cyst will infect me is 20%, then the probability that it will not infect me is 80%. The probability that none of n cysts infect me is (80%)n, and the probability that at least one of n cysts infects me is 1 - (80%)n. The mean of this formula would then be the answer to our second question. This is another example of numerical integration.



Example 4


In 79 weeks, a company has observed 21 transaction failures.

Each transaction failure could cost:


Cost (Ј)

 Probability

70

70%

120

20%

190

10%


Question: What is the cost for these transaction failures next year?

Model Transaction failures shows the solution to this problem.

The links to the Transaction failures software specific models are provided here:

  Transaction_failures


Here is a screenshot of the model:



Cell E10 uses a Gamma distribution to estimate the number of transaction failures per week (λ) in this model. See the Poisson process for an explanation.

Cells E11:D13 calculate the number of transaction failures for each transaction failure cost (Ј70, Ј120, Ј190). The output cell D15 is just multiplying the number of transaction failures of each type by the cost of the failure and summing the values across all three types.

  Transaction_failures


Here is a screenshot of the model:



Cell D10 uses a RiskGamma distribution to estimate the number of transaction failures per week (λ) in this model. See the Poisson process  for an explanation.

Cells D11:D13 calculate the number of transaction failures for each transaction failure cost (Ј70, Ј120, Ј190). The output cell D15 is just multiplying the number of transaction failures of each type by the cost of the failure and summing the values across all three types.





  • No labels