Maximum a posteriori estimation

Last updated

In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution (that quantifies the additional information available through prior knowledge of a related event) over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of maximum likelihood estimation.

Contents

Description

Assume that we want to estimate an unobserved population parameter on the basis of observations . Let be the sampling distribution of , so that is the probability of when the underlying population parameter is . Then the function:

is known as the likelihood function and the estimate:

is the maximum likelihood estimate of .

Now assume that a prior distribution over exists. This allows us to treat as a random variable as in Bayesian statistics. We can calculate the posterior distribution of using Bayes' theorem:

where is density function of , is the domain of .

The method of maximum a posteriori estimation then estimates as the mode of the posterior distribution of this random variable:

The denominator of the posterior distribution (so-called marginal likelihood) is always positive and does not depend on and therefore plays no role in the optimization. Observe that the MAP estimate of coincides with the ML estimate when the prior is uniform (i.e., is a constant function).

When the loss function is of the form

as goes to 0, the Bayes estimator approaches the MAP estimator, provided that the distribution of is quasi-concave. [1] But generally a MAP estimator is not a Bayes estimator unless is discrete.

Computation

MAP estimates can be computed in several ways:

  1. Analytically, when the mode(s) of the posterior distribution can be given in closed form. This is the case when conjugate priors are used.
  2. Via numerical optimization such as the conjugate gradient method or Newton's method. This usually requires first or second derivatives, which have to be evaluated analytically or numerically.
  3. Via a modification of an expectation-maximization algorithm. This does not require derivatives of the posterior density.
  4. Via a Monte Carlo method using simulated annealing

Limitations

While only mild conditions are required for MAP estimation to be a limiting case of Bayes estimation (under the 0–1 loss function), [1] it is not very representative of Bayesian methods in general. This is because MAP estimates are point estimates, whereas Bayesian methods are characterized by the use of distributions to summarize data and draw inferences: thus, Bayesian methods tend to report the posterior mean or median instead, together with credible intervals. This is both because these estimators are optimal under squared-error and linear-error loss respectively—which are more representative of typical loss functions—and for a continuous posterior distribution there is no loss function which suggests the MAP is the optimal point estimator. In addition, the posterior distribution may often not have a simple analytic form: in this case, the distribution can be simulated using Markov chain Monte Carlo techniques, while optimization to find its mode(s) may be difficult or impossible.[ citation needed ]

An example of a density of a bimodal distribution in which the highest mode is uncharacteristic of the majority of the distribution Bimodal density.svg
An example of a density of a bimodal distribution in which the highest mode is uncharacteristic of the majority of the distribution

In many types of models, such as mixture models, the posterior may be multi-modal. In such a case, the usual recommendation is that one should choose the highest mode: this is not always feasible (global optimization is a difficult problem), nor in some cases even possible (such as when identifiability issues arise). Furthermore, the highest mode may be uncharacteristic of the majority of the posterior.

Finally, unlike ML estimators, the MAP estimate is not invariant under reparameterization. Switching from one parameterization to another involves introducing a Jacobian that impacts on the location of the maximum. [2]

As an example of the difference between Bayes estimators mentioned above (mean and median estimators) and using a MAP estimate, consider the case where there is a need to classify inputs as either positive or negative (for example, loans as risky or safe). Suppose there are just three possible hypotheses about the correct method of classification , and with posteriors 0.4, 0.3 and 0.3 respectively. Suppose given a new instance, , classifies it as positive, whereas the other two classify it as negative. Using the MAP estimate for the correct classifier , is classified as positive, whereas the Bayes estimators would average over all hypotheses and classify as negative.

Example

Suppose that we are given a sequence of IID random variables and a prior distribution of is given by . We wish to find the MAP estimate of . Note that the normal distribution is its own conjugate prior, so we will be able to find a closed-form solution analytically.

The function to be maximized is then given by

which is equivalent to minimizing the following function of :

Thus, we see that the MAP estimator for μ is given by

which turns out to be a linear interpolation between the prior mean and the sample mean weighted by their respective covariances.

The case of is called a non-informative prior and leads to an improper probability distribution; in this case

Related Research Articles

In statistics, a location parameter of a probability distribution is a scalar- or vector-valued parameter , which determines the "location" or shift of the distribution. In the literature of location parameter estimation, the probability distributions with such parameter are found to be formally defined in one of the following equivalent ways:

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive is because of randomness or because the estimator does not account for information that could produce a more accurate estimate. In machine learning, specifically empirical risk minimization, MSE may refer to the empirical risk, as an estimate of the true MSE.

<span class="mw-page-title-main">Expectation–maximization algorithm</span> Iterative method for finding maximum likelihood estimates in statistical models

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. It can be used, for example, to estimate a mixture of gaussians, or to solve the multiple linear regression problem.

<span class="mw-page-title-main">Cramér–Rao bound</span> Lower bound on variance of an estimator

In estimation theory and statistics, the Cramér–Rao bound (CRB) relates to estimation of a deterministic parameter. The result is named in honor of Harald Cramér and C. R. Rao, but has also been derived independently by Maurice Fréchet, Georges Darmois, and by Alexander Aitken and Harold Silverstone. It is also known as Fréchet-Cramér–Rao or Fréchet-Darmois-Cramér-Rao lower bound. It states that the precision of any unbiased estimator is at most the Fisher information; or (equivalently) the reciprocal of the Fisher information is a lower bound on its variance.

In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

<span class="mw-page-title-main">Directional statistics</span>

Directional statistics is the subdiscipline of statistics that deals with directions, axes or rotations in Rn. More generally, directional statistics deals with observations on compact Riemannian manifolds including the Stiefel manifold.

<span class="mw-page-title-main">Consistent estimator</span> Statistical estimator converging in probability to a true parameter as sample size increases

In statistics, a consistent estimator or asymptotically consistent estimator is an estimator—a rule for computing estimates of a parameter θ0—having the property that as the number of data points used increases indefinitely, the resulting sequence of estimates converges in probability to θ0. This means that the distributions of the estimates become more and more concentrated near the true value of the parameter being estimated, so that the probability of the estimator being arbitrarily close to θ0 converges to one.

<span class="mw-page-title-main">Rice distribution</span> Probability distribution

In probability theory, the Rice distribution or Rician distribution is the probability distribution of the magnitude of a circularly-symmetric bivariate normal random variable, possibly with non-zero mean (noncentral). It was named after Stephen O. Rice (1907–1986).

In Bayesian probability, the Jeffreys prior, named after Sir Harold Jeffreys, is a non-informative prior distribution for a parameter space; its density function is proportional to the square root of the determinant of the Fisher information matrix:

In statistics, the method of moments is a method of estimation of population parameters. The same principle is used to derive higher moments like skewness and kurtosis.

In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function. Equivalently, it maximizes the posterior expectation of a utility function. An alternative way of formulating an estimator within Bayesian statistics is maximum a posteriori estimation.

In statistics, the bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. In statistics, "bias" is an objective property of an estimator. Bias is a distinct concept from consistency: consistent estimators converge in probability to the true value of the parameter, but may be biased or unbiased; see bias versus consistency for more.

In credibility theory, a branch of study in actuarial science, the Bühlmann model is a random effects model used to determine the appropriate premium for a group of insurance contracts. The model is named after Hans Bühlmann who first published a description in 1967.

<span class="mw-page-title-main">Wrapped normal distribution</span>

In probability theory and directional statistics, a wrapped normal distribution is a wrapped probability distribution that results from the "wrapping" of the normal distribution around the unit circle. It finds application in the theory of Brownian motion and is a solution to the heat equation for periodic boundary conditions. It is closely approximated by the von Mises distribution, which, due to its mathematical simplicity and tractability, is the most commonly used distribution in directional statistics.

<span class="mw-page-title-main">Maximum spacing estimation</span> Method of estimating a statistical models parameters

In statistics, maximum spacing estimation (MSE or MSP), or maximum product of spacing estimation (MPS), is a method for estimating the parameters of a univariate statistical model. The method requires maximization of the geometric mean of spacings in the data, which are the differences between the values of the cumulative distribution function at neighbouring data points.

In statistics, an adaptive estimator is an estimator in a parametric or semiparametric model with nuisance parameters such that the presence of these nuisance parameters does not affect efficiency of estimation.

In statistics, efficiency is a measure of quality of an estimator, of an experimental design, or of a hypothesis testing procedure. Essentially, a more efficient estimator needs fewer input data or observations than a less efficient one to achieve the Cramér–Rao bound. An efficient estimator is characterized by having the smallest possible variance, indicating that there is a small deviance between the estimated value and the "true" value in the L2 norm sense.

<span class="mw-page-title-main">Hermite distribution</span> Statistical probability Distribution for discrete event counts

In probability theory and statistics, the Hermite distribution, named after Charles Hermite, is a discrete probability distribution used to model count data with more than one parameter. This distribution is flexible in terms of its ability to allow a moderate over-dispersion in the data.

In statistics, the variance function is a smooth function that depicts the variance of a random quantity as a function of its mean. The variance function is a measure of heteroscedasticity and plays a large role in many settings of statistical modelling. It is a main ingredient in the generalized linear model framework and a tool used in non-parametric regression, semiparametric regression and functional data analysis. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a smooth function.

References

  1. 1 2 Bassett, Robert; Deride, Julio (2018-01-30). "Maximum a posteriori estimators as a limit of Bayes estimators". Mathematical Programming: 1–16. arXiv: 1611.05917 . doi:10.1007/s10107-018-1241-0. ISSN   0025-5610.
  2. Murphy, Kevin P. (2012). Machine learning : a probabilistic perspective. Cambridge, Massachusetts: MIT Press. pp. 151–152. ISBN   978-0-262-01802-9.