Maximum Likelihood Estimate (MLE)

type

Post

Created date

Jun 16, 2022 01:41 PM

What is maximum likelihood estimate (MLE)?

From (MIT) :

There are many methods for estimating unknown parameters from data, one of which is MLE.

It answer : For which parameter value does the observed data have the biggest probability?

a point estimate because it gives a single value for the unknown parameter

When is it used ?

The concept that when working with a probabilistic model with unknown parameters, the parameters which make the data have the highest probability are the most likely ones. (didl)

The basic intuition behind using maximum likelihood to fit a logistic regression model:

We try to find and such that plugging these estimates into the model for .(statl)

What is its function name ?

It is called : maximum likelihood estimator (ETC2420 Lecture week 4)

Why MLE?

Q1. Why MLE ?

Simple to compute. (MIT)

Intuitively appealing —

we try to find the value of the parameter that would have most likely produced the data we in fact observed. (Libretexts)
find the maximum of that function, i.e. the parameters which are most likely to have produced the observed data. (Fleshman)

Better than least squares method (non-linear) ∵ it has better statistical properties. (statl)

Least squares approach is in fact a special case of maximum likelihood. (statl)

Q2. Why is log better?

It is often easier to work with the log likelihood. (MIT)

Problem : (Fleshman)

Taking the product of small numbers creates smaller numbers, which computers can struggle to represent with finite precision.

Solution : (Fleshman)

To alleviate these numerical issues (and for other conveniences mentioned later), we often work with the log of the likelihood function, aptly named the log-likelihood.

Why does taking the log help? Details are explained.

How do we do this process mathematically?

Both sources tell us to to set derivative to zero :

We could find the MLE by finding the values of θ where the derivative is zero, and finding the one that gives the highest probability. (didl)

If you’re familiar with calculus, you’ll know that you can find the maximum of a function by taking its derivative and setting it equal to 0. The derivative of a function represents the rate of change of the original function. (Fleshman)

There are 3 examples, 2 of which are about flipping the coins. The same intuition and logic but from different sources ( Example 2, 3). Another one is about finding the mean given variance (Example 1).

Example 1 : Finding the mean given variance (Fleshman)

Imagine we have some data generated from a Gaussian distribution with a variance of 4, but we don’t know the mean.

I like to think of MLE as taking the Gaussian, sliding it over all possible means, and choosing the mean which causes the model to fit the data best.

we see the maximum of the log-likelihood occur at a mean of 2. In fact, that is the true mean of the distribution which created the histogram!

If you look at the log-likelihood curve above, we see that

initially it’s changing in the positive direction (moving up). It reaches a peak,

and then it starts changing in a negative direction (moving down).

The key is that at the peak, the rate of change is 0.

So if we know the functional form of the derivative, we can set it equal to 0 and solve for the best parameters.

Example 2 : 55 heads out of 100 flips, find the MLE for the probability p of heads on a single toss. (MIT)

A coin is flipped 100 times. Given that there were 55 heads, find the maximum likelihood estimate for the probability p of heads on a single toss.

Definition of what we are finding:

Step 1 : understand the annotations.

Step 2 : After knowing the definition, compute the which is a notation for the MLE.

2.1. Using Calculus to solve the left part.

2.2. at the end :

RMB : We could find the MLE by finding the values of θ where the derivative is zero, and finding the one that gives the highest probability. — **RMB : We could find the MLE by finding the values of θ where the derivative is** **zero****, and finding the one that gives the highest probability.**

Example 3 : 9 heads out of 13 flips, find the MLE for the probability p of heads on a single toss. (didl)

Question in laymen sense :

I flipped 13 coins, and 9 came up heads, what is our best guess for the probability that
the coin comes us heads?

Step 1 : understand the annotations and scenario

Assumption

Suppose that we have a single parameter θ representing the probability that a coin flip is heads. Then the probability of getting a tails is 1−θ.

If our observed data X is a sequence with heads and tails, we can use the fact that independent probabilities multiply to see that

Scenario

Step 2 : Step 2 : Compute the ^ which is a notation for the MLE.

RMB : We could find the MLE by finding the values of θ where the derivative is zero, and finding the one that gives the highest probability.

So the answer is 9/13, which makes sense.

This has its maximum value somewhere near our expected 9/13 ≈ 0.7

Code can be found in book.

Example : ETC3250 Brendi

Math

The concept and mathematical representation in general can be found here. (Fleshman)

Shorten the log likelihood (ETC3250)

Reference :

(Here)

didl : Zhang, A. (2021). Dive into Deep Learning. (Here)

statl : James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning: with Applications in R (Springer Texts in Statistics) (2nd ed. 2021 ed.). Springer. (Here)

MIT : MLE Intro. Jeremy Orloff and Jonathan Bloom (Here)

Libretexts. (2020, August 10). 7.3: Maximum Likelihood. Statistics LibreTexts. https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/07%3A_Point_Estimation/7.03%3A_Maximum_Likelihood

Fleshman, W. (2019, March 9). Fundamentals of Machine Learning (Part 2) - Towards Data Science. Medium. https://towardsdatascience.com/maximum-likelihood-estimation-984af2dcfcac