To learn more about EpiX Analytics' work, please visit our modeling applications, white papers, and training schedule.

Page tree

 

 

In most situations, we knew precisely the number of random variables we had to add together. However, a problem frequently arises where the number of random variables being summed up is itself a random variable. Some examples are:

 

  • The total purchases by the number of customers N that might enter a shop next week where we know the probability distribution of the purchase amount from a random customer;

  • The amount of lake water that might be drunk by campsite visitors N this summer where we know the probability distribution of the amount of lake water drunk by a random camper, and the resultant number of giardia cysts that might be consumed, where we know the concentration of giardia cysts in the lake water;

  • The amount of bacteria in a vat of liquid egg produced from n eggs where we know the probability distribution of the number of bacteria in a random egg;

  • The cost of insurance claims to an insurer where it knows the expected number of claims it will receive in a period, and knows the probability distribution of the size of a random claim.

 

The model solutions to obtaining this sum are adaptations of the models above for a fixed number of variables being added.

 

Example 1 shows two different ways that one could add up a small but random number of random variables:

 

A company insures airplanes. They crash at a rate of 0.23 crashes per month. Each crash costs $Lognormal(120,52) million.

Question 1: What is the distribution of cost to the company for the next five years?

Question 2: What is the distribution of the value of the liability if discount it at the risk free rate of 5%?

The solution to Question 1 is provided in the model Plane Crashes1.

 

The links to the Plane Crashes1 specific models are provided here:

  plane_crashes1

A screen shot from the model is shown below:

 

 

Cell D9 calculates the number of plane crashes in 5 years using a Poisson(λt) function. In this example λt = 0.23*60 = 13.8, meaning that on average there are 13.8 accidents per 5 years. Column C (C13:C43) calculates the cumulative cost of accidents, in other words, the cost of all accidents before and including the ith accident. Since the probability of having more than 30 accidents per 5 years is very small (see graph below), we have limited the number of crashes in our model to 30.

  plane_crashes1

A screen shot from the model is shown below:

 

 

Cell D9 calculates the number of plane crashes in 5 years using a RiskPoisson(λt) function. In this example λt = 0.23*60 = 13.8, meaning that on average there are 13.8 accidents per 5 years. Column C (C13:C43) calculates the cumulative cost of accidents, in other words, the cost of all accidents before and including the ith accident. Since the probability of having more than 30 accidents per 5 years is very small (see graph below), we have limited the number of crashes in our model to 30.


If one needs to increase either the number of years (t) or the mean number of accidents per unit of time (λ), he/she would also need to increase the number of possible crashes.

 

The same spreadsheet also provides a more efficient method of solving the same problem (Variant 2) as it only generates values from random variables when they are needed.

Question 2 requires that one know the time at which each accident occurred, using Exponential distributions. The solution is shown in the spreadsheet Plane Crashes2.

 

The links to the Plane Crashes2 software specific models are provided here:

  plane_crashes2

A screen shot from the model is shown below:

 

 

In this example model, column D (D14:D44) calculates the time of the ith accident (in months), having the number of months = 0 at the beginning of the 5-year period. Column E (E14:E44) returns the cost of the ith accident (which follows a Lognormal(120,52) distribution) if the number of months has not exceeded 60 for the consequent cell in column F (our period of interest is 5 years) or returns 0 otherwise. Cells in column G (G14:G44) represent a discounted figure for the value in column F.

The outcome of the model is calculated by summing up the discounted values of the column G (cell D10).

  plane_crashes2

A screen shot from the model is shown below:

 

 

In this example model, column B (B14:B44) calculates the time of the ith accident (in months), having the number of months = 0 at the beginning of the 5-year period. Column C (C14:C44) returns the cost of the ith accident (which follows a Lognormal(120,52) distribution) if the number of months has not exceeded 60 for the consequent cell in column C (our period of interest is 5 years) or returns 0 otherwise. Cells in column E (E14:E44) represent a discounted figure for the value in column D.

The outcome of the model is calculated by summing up the discounted values of the column E (cell D10).



Example 2 shows a different approach where one is adding Poisson random variables, each of fixed size.

 

Clumps of cysts appear in water from a particular lake at a rate of 2.3  cysts per 1000m3. The number of cysts in a clump appears to take the distribution from empirical data shown in the table below. A filtration system is used that is able to capture these cyst clumps with varying efficiency depending on the clump size as shown in the third column table below. If a town draws its drinking water from this lake, uses the filtration system, and consumes 7600  m3 of water per year, how many cysts will be consumed? A cyst has a 25% probability of causing illness, how many people will get ill next year?

 

Cyst # in clump

Probability of   cyst #

Filtration capture rate

1

 22%

10.0%

2

27%

19.0%

3

17%

27.1%

4

12%

34.4%

5

9%

41.0%

6

6%

46.9%

7

3%

52.2%

8

2%

57.0%

9

1%

61.3%

10

1%

65.1%

 

Model Clumps of Cysts provides a solution to the above model.

The links to the Clumps of Cysts software specific models are provided here:

 

Adding together large random numbers of random variables

 

For large numbers of random variables, we can use the CLT identity. For example, suppose we think that there will be Poisson(270000) potential customers passing by the front of a store, and that there is a 3% probability that any one of them will enter the store. Assuming each passer-by makes his/her decision to enter independently of any other passersby, the number of people entering the store in a year will be Poisson(270000*3%). If there is a 10% probability that a customer in the store purchases, and again we assume that he/she makes the decision to buy independently of others, the number of purchasers will be Poisson(270000*3%*10%) = Poisson(810). Let's also suppose that we have empirical data on past purchase sizes that can be summarized in the following histogram plot:

 

 

Since the distribution of purchase size by customer is not too skewed, and the number we are adding together large, we can use Central Limit Theorem. The mean and standard deviation of the histogram plot are $12.71 and $7.27 respectively, so a model of the total sales receipts for the year can be built accordingly. The model can be reached here: Sales at the Store

The links to the Sales at the Store software specific models are provided here:

  Sales_at_the_store

 

 

A plot of the Poisson(810) distribution (see figure below) shows that the number of purchasers will in all probability be above about 720:

 

  Sales_at_the_store

 

 

A plot of the Poisson(810) distribution shows that the number of purchasers will in all probability be above about 720.

 

 

Adding together intermediately large random numbers of random variables

 

We are sometimes faced with a distribution of the number of random variables which extends over a range where, at its low end, CLT is not sufficiently accurate, but becomes acceptably accurate at higher numbers of random variables being added.

In this situation, we construct a model that flips between adding up individual values or applying CLT according to a determined threshold for N. The model below offers an example:

 

Problem: You run a special forces division of your country's army. You need to recruit another 16 personnel. The selection process is very tough, and only 34% of people who have ever started the training have completed it. The selection process is also very expensive. It costs $10600 to start a person on the program and $3200 thereafter for each week of training. The training course lasts 13 weeks. A person who fails has equal probability of doing so at any point during the course. You have a budget of $2,200,000. What is the probability that this will be sufficient for your recruitment needs?

 

The solution to the above problem is provided in the following model: CLT Flip

 

The links to the CLT Flip software specific models are provided here:

 

 

 


  • No labels