# Introduction to Bayesian inference concepts

Bayesian inference is based on Bayes' Theorem, the logic of which was first proposed in Bayes (1763). Bayes' Theorem states:

{P(A_{i}|B)=\frac{P(B|A_{i}).P(A_{i})}{\displaystyle\sum_{i-1}^{n}P(B|A_{i})P(A_{i})}} |

(1)

This formula is not very intuitive, so if you are not very familiar with the subject you may want to first review the example below.

### Notation for Bayesian inference

If you are unfamiliar with Bayesian notation, expand this link.

### Bayesian inference is about shapes

The basic equation of Bayesian inference is:

{f(\theta |X)=\frac{\pi (\theta ).l(X|\theta)}{\int \pi (\theta).l(X|\theta).d\theta } \quad \text{when theta continuous }} |

{f(\theta |X)=\frac{\pi (\theta ).l(X|\theta)}{\sum \pi (\theta).l(X|\theta)} \quad \text{when theta discrete}} |

The denominators in these equations are normalizing constants to give the posterior distribution a total confidence of one. Since the denominator is simply a scalar value and not a function of Θ, one can rewrite the equations in a form that is generally more convenient:

{f(\theta |X)\propto \pi (\theta).l(X|\theta)} |

** Example 1**: Comparison of the shapes of relatively more and less informed priors

** Example 2**: Comparison of the shapes of likelihood functions for two data sets. The data set with the greatest information has a much greater focus.

But the amount of information contained in the data can only be measured by how much it changes what you believe. If someone tells you something you already know, you haven't learned anything, but if another person was told the same information, they might have learned a lot. Keeping to our graphical review, the flatter the likelihood function relative to the prior, the smaller the amount of information the data contains:

** Example 3**: The likelihood is flat relative to the prior so has little effect on the level of knowledge (the prior and posterior are very similar)

** Example 4**: The likelihood is highly peaked relative to the prior so has a great influence on the level of knowledge (the likelihood and posterior have very similar shapes)

The closer the shape of the likelihood function to the prior distribution, the smaller the amount of *knowledge* the data contains and so the posterior distribution will not change greatly from the prior. In other words, one would not have learned very much from the data:

** Example 5**: Prior and likelihood have similar shapes (i.e. they agree) so the posterior distribution is not greatly influenced by the prior.

On the other hand, if the focus of the likelihood function is very different from the prior we will have learned a lot from the data:

** Example 6**: The likelihood is highly peaked relative to the prior so has a great influence on the level of knowledge (the likelihood and posterior have very similar shapes)

That we learn a lot from a set of data does not *necessarily* mean that we are more confident about the parameter value afterwards. If the prior and likelihood strongly conflict, it is quite possible that our posterior distribution is broader than our prior. Conversely, if the likelihood leans towards an extreme of the possible range for the parameter, we can have a likelihood that has a significantly different emphasis to the prior, yet we get a posterior distribution that is narrower than the prior distribution:

** Example 7**: The likelihood is highly peaked relative to the prior and focused on one extreme of the prior's range, so is in reasonable disagreement with the prior, yet the posterior is strongly focused despite the disagreement because the parameter cannot be negative and is therefore constrained at zero.