|
Bayes' theorem is a result in probability
theory, which gives the conditional probability distribution of a random variable A given B in terms of the conditional
probability distribution of variable B given A and the marginal probability distribution of A alone. As a mathematical theorem, Bayes' theorem is valid regardless of whether one adopts a frequentist or a Bayesian interpretation
of probability. However, there is disagreement as to what kinds of variables
can be substituted for A and B in the theorem; this topic is treated at greater length in the articles on
Bayesian probability and frequentist probability.
Historical remarks
Bayes' theorem is named after the Reverend Thomas Bayes (1702–61).
Bayes worked on the problem of computing a distribution for the parameter of a binomial distribution (to use modern terminology);
his work was edited and presented posthumously (1763) by his friend Richard Price, in An Essay towards solving a Problem in
the Doctrine of Chances. Bayes' results were
replicated and extended by Laplace in an essay of 1774, who
apparently was not aware of Bayes' work.
The main result (Proposition 9 in the essay) derived by Bayes is the following: assuming a uniform distribution for the prior
distribution of the binomial parameter p, the probability that p is between two values a and
b is
-
where m is the number of observed successes and n the number of observed failures. His preliminary results,
in particular Propositions 3, 4, and 5, imply the result now called Bayes' Theorem (as described below), but it does not appear
that Bayes himself emphasized or focused on that result.
What is "Bayesian" about Proposition 9 is that Bayes presented it as a probability for the parameter p. That is, not
only can one compute probabilities for experimental outcomes, but also for the parameter which governs them, and the same algebra
is used to make inferences of either kind. Interestingly, Bayes actually states his question in a way that might make the idea of
assigning a probability distribution to a parameter palatable to a frequentist. He supposes that a billiard ball is thrown at
random onto a billiard table, and that the probabilities p and q are the probabilities that subsequent billiard
balls will fall above or below the first ball. By making the binomial parameter p depend on a random event, he cleverly
escapes a philosophical quagmire that he most likely was not even aware was an issue.
Statement of Bayes' theorem
Bayes' theorem is a relation among conditional and marginal probabilities. It can be viewed as a means of incorporating
information, from an observation, for example, to produce a modified or updated probability distribution.
Suppose the marginal probability density function or probability mass function of a random variable X is
- fX(x)
(be very careful to distinguish between the capital X and the lower-case x above!). This is the
prior probability distribution of X. Suppose the conditional probability density function or
probability mass function of Y given X = x is
-
As a function of y, this is the likelihood
function
-
The likelihood function is not a probability density function or a probability mass function.
Bayes theorem says:
-
- To get the posterior probability distribution of X (i.e., the conditional probability distibution of
X given Y), multiply the prior probability density function (or mass function) by the likelihood function, and
then normalize.
"Normalize" means to multiply by a constant to make the resulting function a probability density function or a probability
mass function. Thus the posterior probability density function is
-
The normalizing constant in the denominator is
-
In the discrete case, one would have a sum rather than an integral. If one takes the measure-theoretic viewpoint, either is an integral.
Example
Suppose the proportion R of voters who will vote "yes" in a referendum is uniformly distributed between 0 and 1. That is the prior probability distribution of R. A
random sample of 10 voters is taken, and it is found that seven of them will vote "yes". The conditional distribution of the
number X of voters in this small sample who will vote "yes", given that (capital) R is some particular number
(lower-case) r, is a binomial distribution with
parameters 10 and r, i.e., it is the distribution of the number of "successes" in 10 independent Bernoulli trials with probability r of success on each trial. One therefore has
-
Since X was observed to be 7, the likelihood function is
-
for 0 ≤ r ≤ 1. The prior probability density function is
-
and 0 otherwise. Multiplying the prior by the likelihood, we get
-
if 0 ≤ r ≤ 1, and 0 otherwise. Integrating, we get
-
so the posterior probability density function is
-
for r between 0 and 1, and 0 otherwise.
One may be interested in the probability that more than half the voters will vote "yes". The prior probability that
more than half the voters will vote "yes" is 1/2, by the symmetry of the uniform distribution. The posterior probability that more than half the voters will vote "yes", i.e.,
the conditional probability given the outcome of the opinion poll -- that seven of the 10 voters questioned will vote "yes" --
is
-
about an "89% chance".
Derivation in the discrete case
To derive Bayes' theorem in the discrete case, note first from the definition of conditional probability that
-
denoting by P(A,B) the joint probability of A and B. Dividing the left- and
right-hand sides by P(B), we obtain
-
which is Bayes' theorem.
Each term in Bayes' theorem has a conventional name. The term P(A) is called the prior probability of A. It is "prior" in the sense that it
precedes any information about B. Equivalently, P(A) is also called the marginal probability
of A. The term P(A|B) is called the posterior probability of A, given B. It is "posterior" in the sense that it is
derived from or entailed by the specified value of B. The term P(B|A), for a specific value
of B, is called the likelihood function for
A. The term P(B) is called the prior or marginal probability of B.
Alternative forms of Bayes' theorem
Bayes' theorem is often embellished by noting that
-
so the theorem can be restated as
-
where AC is the complementary event of A. More generally, where {Ai} forms a
partition of the event space,
-
for any Ai in the partition.
See also the law of total probability.
Bayes' theorem for probability densities
There is also a version of Bayes' theorem for continuous distributions. It is somewhat harder to derive, since probability
densities, strictly speaking, are not probabilities, so Bayes' theorem has to be established by a limit process; see Papoulis
(citation below), Section 7.3 for an elementary derivation. Bayes' theorem for probability densities is formally similar to the
theorem for probabilities:
-
and there is an analogous statement of the law of total probability:
-
As in the discrete case, the terms have standard names. f(x, y) is the joint distribution of
x and y, f(y|x) is the posterior distribution, f(x|y) is
the likelihood function, and f(x) and f(y) are marginal distributions. Here we have indulged
in a conventional abuse of notation, using f for each one of these terms, although each one is really a different
function; the functions are distinguished by the names of their arguments.
Extensions of Bayes' theorem
Theorems analogous to Bayes' theorem hold in problems with more than two variables. These theorems are not given distinct
names, as they may be mass-produced by applying the laws of
probability. The general strategy is to work with a decomposition of the joint probability, and to marginalize (integrate) over
the variables that are not of interest. Depending on the form of the decomposition, it may be possible to prove that some
integrals must be 1, and thus they fall out of the decomposition; exploiting this property can reduce the computations very
substantially. A Bayesian network is essentially a mechanism for
automatically generating the extensions of Bayes' theorem that are appropriate for a given decomposition of the joint
probability.
Examples
Typical examples that use Bayes' theorem assume Bayesian
probability. For worked out examples, please see Examples of Bayesian inference.
References
Versions of the essay
- Thomas Bayes (1763), "An Essay towards solving a Problem in the Doctrine of Chances", Philosophical Transactions of the
Royal Society of London, 53.
- Thomas Bayes (1763/1958) "Studies in the History of Probability and Statistics: IX. Thomas Bayes's Essay Towards Solving a
Problem in the Doctrine of Chances", Biometrika 45:296-315 (Bayes's essay in modernized notation)
Commentaries
- G.A. Barnard. (1958) "Studies in the History of Probability and Statistics: IX. Thomas Bayes's Essay Towards Solving a
Problem in the Doctrine of Chances", Biometrika 45:293-295 (biographical remarks)
- Stephen M. Stigler (1982) "Thomas Bayes' Bayesian Inference," Journal of the Royal Statistical Society, Series A,
145:250-258 (Stigler argues for a revised interpretation of the essay -- recommended)
- Isaac Todhunter (1865) A History of the Mathematical Theory
of Probability from the time of Pascal to that of Laplace, Macmillan. Reprinted 1949, 1956 by Chelsea and 2001 by
Thoemmes.
Additional material
- Pierre-Simon Laplace (1774), "Mémoire sur la Probabilité des Causes par les Événements," Savants Étranges 6:621-656,
also Oeuvres 8:27-65.
- Pierre-Simon Laplace (1774/1986), "Memoir on the Probability of the Causes of Events", Statistical Science,
1(3):364--378.
- Stephen M. Stigler (1986), "Laplace's 1774 memoir on inverse probability," Statistical Science, 1(3):359--378.
- Stephen M. Stigler (1983), "Who Discovered Bayes's Theorem?" The American Statistician, 37(4):290-296.
- Athanasios Papoulis (1984), Probability, Random
Variables, and Stochastic Processes, second edition. New York: McGraw-Hill.
See also
|