# Introduction to Bayesian inference concepts

Bayesian inference is based on Bayes' Theorem, the logic of which was first proposed in Bayes (1763). Bayes' Theorem states:

P(A_{i}|B)=\frac{P(B|A_{i}).P(A_{i})}{\displaystyle\sum_{i-1}^{n}P(B|A_{i})P(A_{i})} |

(1)

This formula is not very intuitive, so if you are not very familiar with the subject you may want to first review the example below.

### Notation for Bayesian inference

Bayesian inference is about learning from evidence. We start with a *prior *knowledge (which can range from non-existent to very strong), and after observing *data *we modify our *current knowledge* of the parameter of interest:

θ is the parameter or parameters that we are trying to estimate.

*x* is the data or evidence that we have observed.

π(θ) is the *prior distribution* or simply the *prior*. This is a distribution of our knowledge about θ *before *we have observed the data x. In other words, it's what we knew before collecting new information.

*l*(x|θ) is the *likelihood* *function*. This is a mathematical representation of the probability of observing the data *x *given a value of θ under certain stochastic process. Put in simple terms, the likelihood states "*If I knew the value of θ and the stochastic process that generated x, what would be the likelihood of actually observing the data x?* This can be confusing, as the goal often is to try to estimate the value of an unknown θ, but it will be clearer as you read this section.

*f(θ|x) *is the *posterior distribution *or the *posterior. *The posterior represents our current knowledge of θ that incorporates both what we knew before π(θ) and the current evidence *l*(x|θ). The goal of a bayesian analysis is to estimate this posterior distribution.

### Bayesian inference is about shapes

The basic equation of Bayesian inference is:

f(\theta |X)=\frac{\pi (\theta ).l(X|\theta)}{\int \pi (\theta).l(X|\theta).d\theta } \quad \text{when theta continuous } |

f(\theta |X)=\frac{\pi (\theta ).l(X|\theta)}{\sum \pi (\theta).l(X|\theta)} \quad \text{when theta discrete} |

The denominators in these equations are normalizing constants to give the posterior distribution a total confidence of one. Since the denominator is simply a scalar value and not a function of Θ, one can rewrite the equations in a form that is generally more convenient:

f(\theta |X)\propto \pi (\theta).l(X|\theta) |

(4)

The *shape* of the prior distribution embodies the amount of knowledge we have about the parameter to start with. The more informed we are, the more focused the prior distribution will be:

** Example 1**: Comparison of the shapes of relatively more and less informed priors

The *shape* of the likelihood function embodies the amount of *information* contained in the data. If the information it contains is small, the likelihood function will be broadly distributed, whereas if the information it contains is large, the likelihood function will be tightly focused around some particular value of the parameter:

** Example 2**: Comparison of the shapes of likelihood functions for two data sets. The data set with the greatest information has a much greater focus.

But the amount of information contained in the data can only be measured by how much it changes what you believe. If someone tells you something you already know, you haven't learned anything, but if another person was told the same information, they might have learned a lot. Keeping to our graphical review, the flatter the likelihood function relative to the prior, the smaller the amount of information the data contains:

** Example 3**: The likelihood is flat relative to the prior so has little effect on the level of knowledge (the prior and posterior are very similar)

** Example 4**: The likelihood is highly peaked relative to the prior so has a great influence on the level of knowledge (the likelihood and posterior have very similar shapes)

The closer the shape of the likelihood function to the prior distribution, the smaller the amount of *knowledge* the data contains and so the posterior distribution will not change greatly from the prior. In other words, one would not have learned very much from the data:

** Example 5**: Prior and likelihood have similar shapes (i.e. they agree) so the posterior distribution is not greatly influenced by the prior.

On the other hand, if the focus of the likelihood function is very different from the prior we will have learned a lot from the data:

** Example 6**: The likelihood is highly peaked relative to the prior so has a great influence on the level of knowledge (the likelihood and posterior have very similar shapes)

That we learn a lot from a set of data does not *necessarily* mean that we are more confident about the parameter value afterwards. If the prior and likelihood strongly conflict, it is quite possible that our posterior distribution is broader than our prior. Conversely, if the likelihood leans towards an extreme of the possible range for the parameter, we can have a likelihood that has a significantly different emphasis to the prior, yet we get a posterior distribution that is narrower than the prior distribution:

** Example 7**: The likelihood is highly peaked relative to the prior and focused on one extreme of the prior's range, so is in reasonable disagreement with the prior, yet the posterior is strongly focused despite the disagreement because the parameter cannot be negative and is therefore constrained at zero.