Please disable your adblock and script blockers to view this page

A sane introduction to maximum likelihood estimation and maximum a posteriori

p_{\theta^*}(x) \right]}_{\text{Entropy
Machine Learning
Markov Chain Monte Carlo
\underbrace{\mathop{\rm arg\,max}\limits_{\theta} \sum_i \log p_{\theta}(x_i)}_{\text{Equivalent
KL Divergence
CommentName Email


\(\mathcal{L}(\cdot | \cdot)\
| x
P_{\theta}(Y | X
p_{\theta^*}(y | x)}\left[\log \
Christian S. PeroneExcellent


No matching tags


\(n\to\infty\):$$ \frac{1}{n

No matching tags

Positivity     63.00%   
   Negativity   37.00%
The New York Times
Write a review: Hacker News

Independent random variables mean that the following holds:$$p_{\theta}(x_1, x_2, \ldots, x_n) = \prod_{i=1}^{n} p_{\theta}(x_i)$$Which means that since \(x_1, x_2, \ldots, x_n\) don’t contain information about each other, we can write the joint probability as a product of their marginals.Another assumption that is made, is that these random variables are identically distributed, which means that they came from the same generating distribution, which allows us to model it with the same distribution parametrization.Given these two assumptions, which are also known as IID (independently and identically distributed), we can formulate our maximum likelihood estimation problem as:$$\hat{\theta} = \mathrm{arg}\max_\theta \prod_{i=1}^{n} p_{\theta}(x_i)$$Given that in many cases these densities that we multiply can be very small, multiplying one by the other in the product that we have above we can end up with very small values. This is an important point and it is usually implictly assumed.The weak law of large numbers can be bounded using a Chebyshev bound, and if you are interested in concentration inequalities, I’ve made an article about them here where I discuss the Chebyshev bound.To finish our formulation, given that we usually minimize objectives, we can formulate the same maximum likelihood estimation as the minimization of the negative of the log-likelihood:$$ \hat{\theta} = \mathrm{arg}\min_\theta -\mathbb{E}_{x \sim p_{\theta^*}(x)}\left[\log \, p_{\theta}(x) \right] $$Which is exactly the same thing with just the negation turn the maximization problem into a minimization problem.It is well-known that maximizing the likelihood is the same as minimizing the Kullback-Leibler divergence, also known as the KL divergence. In the end, we are left with two terms, the first one in the left is the entropy and the one in the right you can recognize as the negative of the log-likelihood that we saw earlier.If we want to minimize the KL divergence for the \(\theta\), we can ignore the first term, since it doesn’t depend of \(\theta\) in any way, and in the end we have exactly the same maximum likelihood formulation that we saw before:$$ \begin{eqnarray} \require{cancel} \theta^* &=& \mathrm{arg}\min_\theta \cancel{\mathbb{E}_{x \sim p_{\theta^*}(x)} \left[\log \, p_{\theta^*}(x) \right]} – \mathbb{E}_{x \sim p_{\theta^*}(x)}\left[\log \, p_{\theta}(x) \right]\\ &=& \mathrm{arg}\min_\theta -\mathbb{E}_{x \sim p_{\theta^*}(x)}\left[\log \, p_{\theta}(x) \right] \end{eqnarray} $$A very common scenario in Machine Learning is supervised learning, where we have data points \(x_n\) and their labels \(y_n\) building up our dataset \( D = \{ (x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n) \} \), where we’re interested in estimating the conditional probability of \(\textbf{y}\) given \(\textbf{x}\), or more precisely \( P_{\theta}(Y | X) \).To extend the maximum likelihood principle to the conditional case, we just have to write it as:$$ \hat{\theta} = \mathrm{arg}\min_\theta -\mathbb{E}_{x \sim p_{\theta^*}(y | x)}\left[\log \, p_{\theta}(y | x) \right] $$And then it can be easily generalized to formulate the linear regression:$$ p_{\theta}(y | x) \sim \mathcal{N}(x^T \theta, \sigma^2) \\ p_{\theta}(y | x) = -n \log \sigma – \frac{n}{2} \log{2\pi} – \sum_{i=1}^{n}{\frac{\| x_i^T \theta – y_i \|}{2\sigma^2}} $$In that case, you can see that we end up with a sum of squared errors that will have the same location of the optimum of the mean squared error (MSE). We’ll also see that the MAP has a strong connection with the regularized MLE estimation.We know from the Bayes rule that we can get the posterior from the product of the likelihood and the prior, normalized by the evidence:$$ \begin{align} p(\theta \vert x) &= \frac{p_{\theta}(x) p(\theta)}{p(x)} \\ \label{eq:proport} &\propto p_{\theta}(x) p(\theta) \end{align} $$In the equation \(\ref{eq:proport}\), since we’re worried about optimization, we cancel the normalizing evidence \(p(x)\) and stay with a proportional posterior, which is very convenient because the marginalization of \(p(x)\) involves integration and is intractable for many cases.$$ \begin{align} \theta_{MAP} &= \mathop{\rm arg\,max}\limits_{\theta} p_{\theta}(x) p(\theta) \\ &= \mathop{\rm arg\,max}\limits_{\theta} \prod_{i=1}^{n} p_{\theta}(x_i) p(\theta) \\ &= \mathop{\rm arg\,max}\limits_{\theta} \sum_{i=1}^{n} \underbrace{\log p_{\theta}(x_i)}_{\text{Log likelihood}} \underbrace{p(\theta)}_{\text{Prior}} \end{align} $$In this formulation above, we just followed the same steps as described earlier for the maximum likelihood estimator, we assume independence and an identical distributional setting, followed later by the logarithm application to switch from a product to a summation.

As said here by View all posts by Christian S. Perone ?