To learn more about EpiX Analytics' work, please visit our modeling applications, white papers, and training schedule.

Page tree

Introduction to Bayesian inference concepts

Bayesian inference is based on Bayes' Theorem, the logic of which was first proposed in Bayes (1763). Bayes' Theorem states:



This formula is not very intuitive, so if you are not very familiar with the subject you may want to first review the example below.

Imagine that you are a doctor. A patient comes in with a particularly unusual set of symptoms. About 70% of people with those symptoms have disease A, 20% disease B and 10% disease C. So, your best guess is obviously that this person has disease A. We could say that your prior belief looks like this:

The vertical scale represents how confident you are about the true state of the patient. The disease the patient is suffering from is not a random variable, so the vertical axis is not probability.

Being a doctor, you have a simple test you can perform. You take a sample of saliva, add it to a tube with some chemical, and look for any change in color. It says on the box this test comes in that the test will turn black with different probabilities:

60% for a person has disease A;

90% for a person has disease b; and

100% for a person with disease C

The test turns black… so what is your belief now about the patient? An event tree plot of what could have happened would help your thinking:

The probability that a person came in with disease A and gave a black test result = 0.7*0.6 = 0.42;

For disease B it is 0.2*0.9 = 0.18; and

For disease C it is o.1*1.0 = 0.1.

One of these three scenarios must have occurred, so you weight the three according to these probabilities:

Confidence it is disease A:  

Confidence it is disease B: 

Confidence it is disease C: 

Now your belief looks like this:

It shows that the test hasn't much affected your belief. Even though, for example, the test had a 100% probability of giving the observed result for disease C, you are still reasonably certain it is not the one. You will probably treat the patient for disease A.

We have just performed the Bayesian inference calculation of Equation 1 introduced earlier.

The equation is answering the question: How confident are we about state Ai given we have seen B? In this problem, Ai are the states of the patient (A1,A2,A3 = disease A, B, C respectively), and B is what we observed (the black test result). So:

P(A1) = 0.7

P(A2) = 0.2

P(A3) = 0.1

P(B|A1) = 0.6

P(B|A2) = 0.9

P(B|A3) = 1.0

Notation for Bayesian inference

Bayesian inference is about learning from evidence. We start with a prior knowledge (which can range from non-existent to very strong), and after observing data we modify our current knowledge of the parameter of interest:

θ is the parameter or parameters that we are trying to estimate.

x is the data or evidence that we have observed.

π(θ) is the prior distribution or simply the prior. This is a distribution of our knowledge about θ before we have observed the data x. In other words, it's what we knew before collecting new information. 

l(x|θ) is the likelihood function. This is a mathematical representation of the probability of observing the data x given a value of θ under certain stochastic process. Put in simple terms, the likelihood states "If I knew the value of θ and the stochastic process that generated x, what would be the likelihood of actually observing the data x? This can be confusing, as the goal often is to try to estimate the value of an unknown θ, but it will be clearer as you read this section.

f(θ|x) is the posterior distribution or the posterior. The posterior represents our current knowledge of θ that incorporates both what we knew before π(θ) and the current evidence l(x|θ). The goal of a bayesian analysis is to estimate this posterior distribution.  

Bayesian inference is about shapes

The basic equation of Bayesian inference is:

f(\theta |X)=\frac{\pi (\theta ).l(X|\theta)}{\int \pi (\theta).l(X|\theta).d\theta } \quad \text{when theta continuous }
f(\theta |X)=\frac{\pi (\theta ).l(X|\theta)}{\sum \pi (\theta).l(X|\theta)} \quad \text{when theta discrete}

The denominators in these equations are normalizing constants to give the posterior distribution a total confidence of one. Since the denominator is simply a scalar value and not a function of Θ, one can rewrite the equations in a form that is generally more convenient:


f(\theta |X)\propto \pi (\theta).l(X|\theta)


The shape of the prior distribution embodies the amount of knowledge we have about the parameter to start with. The more informed we are, the more focused the prior distribution will be:

Example 1: Comparison of the shapes of relatively more and less informed priors

The shape of the likelihood function embodies the amount of information contained in the data. If the information it contains is small, the likelihood function will be broadly distributed, whereas if the information it contains is large, the likelihood function will be tightly focused around some particular value of the parameter:

Example 2: Comparison of the shapes of likelihood functions for two data sets. The data set with the greatest information has a much greater focus.

But the amount of information contained in the data can only be measured by how much it changes what you believe. If someone tells you something you already know, you haven't learned anything, but if another person was told the same information, they might have learned a lot. Keeping to our graphical review, the flatter the likelihood function relative to the prior, the smaller the amount of information the data contains:

Example 3: The likelihood is flat relative to the prior so has little effect on the level of knowledge (the prior and posterior are very similar)

Example 4: The likelihood is highly peaked relative to the prior so has a great influence on the level of knowledge (the likelihood and posterior have very similar shapes)

The closer the shape of the likelihood function to the prior distribution, the smaller the amount of knowledge the data contains and so the posterior distribution will not change greatly from the prior. In other words, one would not have learned very much from the data:

Example 5: Prior and likelihood have similar shapes (i.e. they agree) so the posterior distribution is not greatly influenced by the prior.

On the other hand, if the focus of the likelihood function is very different from the prior we will have learned a lot from the data:

Example 6: The likelihood is highly peaked relative to the prior so has a great influence on the level of knowledge (the likelihood and posterior have very similar shapes)

That we learn a lot from a set of data does not necessarily mean that we are more confident about the parameter value afterwards. If the prior and likelihood strongly conflict, it is quite possible that our posterior distribution is broader than our prior. Conversely, if the likelihood leans towards an extreme of the possible range for the parameter, we can have a likelihood that has a significantly different emphasis to the prior, yet we get a posterior distribution that is narrower than the prior distribution: 

Example 7: The likelihood is highly peaked relative to the prior and focused on one extreme of the prior's range, so is in reasonable disagreement with the prior, yet the posterior is strongly focused despite the disagreement because the parameter cannot be negative and is therefore constrained at zero.

  • No labels