# Log likelihood function and EM algorithm

#### Jameson

Staff member
I'm going to start by asking about an example from class and then hopefully use that to work on the problem I need to solve. Here is an example:

Let's say we have a multinomial distribution $x \sim M(n;.5+.25\theta,0.25(1-\theta),0.25(1-\theta),0.5\theta)$.

The likelihood function of $\theta$ given the vector $x$ is:

$$\displaystyle f(x|\theta) = \frac{n!}{x_1!x_2!x_3!x_4!}(.5+.25\theta)^{x_1}(0.25(1-\theta)^{x_2+x_3}(0.5\theta)^{x_4}$$.

This part is fine. It's this next step that confuses me.

$$\displaystyle \log (f(x|\theta))=x_1 \log(2+ \theta)+(x_2+x_3) \log(1-\theta)+x_4 \log(\theta)+\text{constant}$$

I understand the standard log rules I believe but I don't see how $\log(.5+.25\theta)^{x_1}=x_1 \log(2+ \theta)$. Obviously $x_1$ can be brought down in front of the expression but the stuff inside the parentheses makes no sense. #### MarkFL

Staff member
I would look at it as:

$$\displaystyle \log(.5+.25\theta)^{x_1}=x_1\log\left(\frac{1}{2}+\frac{\theta}{4} \right)=x_1\log\left(\frac{2+\theta}{4} \right)=x_1 \log(2+ \theta)+\text{constant}$$

#### Jameson

Staff member
That's it! Nice eye. Ok, so here is the problem I need to do. How does this look?

$\displaystyle f(x|\theta) = \frac{n!}{x_1!x_2!x_3!x_4!}(0.25\theta)^{x_1}(0.25(2+\theta))^{x_2}(0.5(1-2\theta))^{x_3}(0.5\theta)^{x_4}$

$\displaystyle f(x|\theta) = \frac{n!}{x_1!x_2!x_3!x_4!}\left( \frac{\theta}{4}\right)^{x_1}\left( \frac{2+\theta}{4} \right)^{x_2}\left( \frac{1-2\theta}{2} \right)^{x_3} \left(\frac{\theta}{2} \right)^{x_4}$

$\log (f(x|\theta))=x_1 \log(\theta)+x_2 \log(2+\theta)+ x_3 \log(1-2\theta)+x_4 \log(\theta) +\text{constant}$

#### MarkFL

Staff member
That's it! Nice eye. Ok, so here is the problem I need to do. How does this look?

$\displaystyle f(x|\theta) = \frac{n!}{x_1!x_2!x_3!x_4!}(0.25\theta)^{x_1}(0.25(2+\theta))^{x_2}(0.5(1-2\theta))^{x_3}(0.5\theta)^{x_4}$

$\displaystyle f(x|\theta) = \frac{n!}{x_1!x_2!x_3!x_4!}\left( \frac{\theta}{4}\right)^{x_1}\left( \frac{2+\theta}{4} \right)^{x_2}\left( \frac{1-2\theta}{2} \right)^{x_3} \left(\frac{\theta}{2} \right)^{x_4}$

$\log (f(x|\theta))=x_1 \log(\theta)+x_2 \log(2+\theta)+ x_3 \log(1-2\theta)+x_4 \log(\theta) +\text{constant}$
Yes, looks like you have properly applied the appropriate properties of logs to me. #### Jameson

Staff member
So the next step according to my notes is to make a partition of one of the $x$ variables by writing $x_n = y_1+y_2$ for some $n$. My professor suggested partitioning $x_2$.

So $x_2=y_1+y_2 \rightarrow y_1 = 0.5 \text{ and } y_2=0.25\theta$, where $(y_1,y_2) \sim \text{Bin }(x_1; 0.5/(0.5+0.25\theta), 0.25\theta/(0.5+0.25\theta))$

Set the complete data to be $y=(x_1,y_1,y_2,x_3,x_4)$

The complete log likelihood function should now be $(x_1+y_2) \log(\theta)+ x_3 \log(1-2\theta)+x_4 \log(\theta) +\text{constant}$

I unfortunately am not even at the $E$ step yet for the EM algorithm but I believe this is how you must do it. Am I on the right track? I realize this might be a topic not many are familiar with but I've read that this algorithm is widely used.

#### MarkFL

Staff member
I'm not familiar with this topic at all, but, it appears to be that to get the complete log likelihood function you cite, we require:

$$\displaystyle y_2\log(\theta)=x_2\log(2+\theta)+\text{constant}$$

Does this look right?

#### Jameson

Staff member
Good thinking, but I'm not sure that's the way to go. Luckily the next step was already done in the notes for $x_2=0.5+0.25\theta$. Don't ask me what any of the next stuff means though . I am going from my notes and would love it if someone could make sense of it. I do believe this is correct though and I'm making progress.

We first evaluate $\displaystyle E[y_2|\theta_{k},x]=\frac{\theta_{k}}{2+\theta_{k}}x_2$ and we can finally find:

$\displaystyle Q(\theta|\theta_{k},x)=\left(\frac{\theta_{k}}{ 2+\theta_{k}}x_2+x_1 \right)\log(\theta)+ x_3 \log(1-2\theta)+x_4 \log(\theta)$

#### MarkFL

Staff member
Yeah, I have to admit I am lost with those next steps as well, but hopefully one of our advanced statistics folks can help you make sense of it. #### Jameson

Staff member
Well here is the best it's going to get for me for now.

Now for the M-step, we find $$\displaystyle \frac{\partial Q}{\partial \theta}=0$$.

$\displaystyle \theta_{k+1}=\frac{(\theta_k+2)x_1+\theta_k x_2 +(\theta_k+2)x_4}{2((\theta_k+2)x_1+\theta_k x_2+(\theta_k+2)(x_3+x_4))}$.

When applied for $x = [6,52,28,14]$ and $n=100$ I get the following plot. I'll try to make sense of it tomorrow. Any insight is appreciated!

#### Klaas van Aarsen

##### MHB Seeker
Staff member
I'm going to start by asking about an example from class and then hopefully use that to work on the problem I need to solve. Here is an example:

Let's say we have a multinomial distribution $x \sim M(n;.5+.25\theta,0.25(1-\theta),0.25(1-\theta),0.5\theta)$.
Hmm, in a multinomial distribution the sum of the probabilities should be 1.
However, in your case it isn't.
Is it possible the last probability should be $0.25\theta$ instead of $0.5\theta$?

I would look at it as:

$$\displaystyle \log(.5+.25\theta)^{x_1}=x_1\log\left(\frac{1}{2}+\frac{\theta}{4} \right)=x_1\log\left(\frac{2+\theta}{4} \right)=x_1 \log(2+ \theta)+\text{constant}$$
I like to think of it as:
$$x_1\log\left(\frac{2+\theta}{4} \right)=x_1 \log(2+ \theta)-x_1 \log 4$$

That's it! Nice eye. Ok, so here is the problem I need to do. How does this look?

$\displaystyle f(x|\theta) = \frac{n!}{x_1!x_2!x_3!x_4!}(0.25\theta)^{x_1}(0.25(2+\theta))^{x_2}(0.5(1-2\theta))^{x_3}(0.5\theta)^{x_4}$

$\displaystyle f(x|\theta) = \frac{n!}{x_1!x_2!x_3!x_4!}\left( \frac{\theta}{4}\right)^{x_1}\left( \frac{2+\theta}{4} \right)^{x_2}\left( \frac{1-2\theta}{2} \right)^{x_3} \left(\frac{\theta}{2} \right)^{x_4}$

$\log (f(x|\theta))=x_1 \log(\theta)+x_2 \log(2+\theta)+ x_3 \log(1-2\theta)+x_4 \log(\theta) +\text{constant}$
$$\mathcal{L}(\theta|x) = f(x|\theta)$$
It is rather important that it's a function of $\theta$ and in particular not a function of $x$.
Therefore all terms containing only $x$ and not $\theta$ are effectively constants.
That is why all the $\log x_1!$ stuff and such disappears into a constant.

Hope this helps. #### Jameson

Staff member
Hi I like Serena!

Yep, you are correct. Here is a screenshot of the slide with this example problem. Good point about the function being a function of $\theta$ and not $x$. It makes more sense now why we group of the terms not containing $\theta$ into one constant term.

The EM algorithm process has two main steps, E and M. The E step is still really tricky for me. It seems I need to first rewrite one of the variables in terms of two variables $y_1$ and $y_2$ and then find:

$$\displaystyle Q(\theta|\theta_k,x)=E[\log(f(y|\theta)|\theta_k,x)]$$

In the example provided to us this step is very computational intensive. Is it always so?

#### Klaas van Aarsen

##### MHB Seeker
Staff member
That's it! Nice eye. Ok, so here is the problem I need to do. How does this look?

$\displaystyle f(x|\theta) = \frac{n!}{x_1!x_2!x_3!x_4!}(0.25\theta)^{x_1}(0.25(2+\theta))^{x_2}(0.5(1-2\theta))^{x_3}(0.5\theta)^{x_4}$
It seems you did not copy the expression properly.
The probabilities are not correct.
If you correct them you should get the proper result.

Well here is the best it's going to get for me for now.

Now for the M-step, we find $$\displaystyle \frac{\partial Q}{\partial \theta}=0$$.

$\displaystyle \theta_{k+1}=\frac{(\theta_k+2)x_1+\theta_k x_2 +(\theta_k+2)x_4}{2((\theta_k+2)x_1+\theta_k x_2+(\theta_k+2)(x_3+x_4))}$.

When applied for $x = [6,52,28,14]$ and $n=100$ I get the following plot.

View attachment 1455

I'll try to make sense of it tomorrow. Any insight is appreciated!
It appears they have turned it into an algorithm that approaches $\theta$ such that it has the maximum likelihood.
The result is $\theta \approx 0.24$.
The second graph might be of $\log \theta$ I guess.

It looks a bit convoluted to me though.
Let me give a simpler approach.

If we take the expression for the log-likelihood function:
$$\log \mathcal{L}(\theta|x) = x_1\log(2+\theta) + (x_2 + x_3) \log(1-\theta) + x_4 \log \theta + C$$
Since we want to find $\theta$ such that the likelihood function takes its maximal value, we take the derivative and set it equal to 0:
$$\frac d{d\theta} \log \mathcal{L}(\theta|x) = \frac{x_1}{2+\theta} - \frac{x_2 + x_3}{1-\theta} + \frac {x_4}\theta = 0$$
Solving this with the numbers given, yields the maximum likelihood estimation (MLE) of $\theta$:
$$\theta_{MLE} = 0.15$$
Hmm, that seems to be different from the given result. Last edited:

#### Jameson

Staff member
I believe you are looking at the example problem. The problem I was given to solve was $x \sim M(n;0.25\theta,0.25\theta(2+\theta),0.5(1-2\theta),0.5\theta)$. Does my log likelihood function look good now? (by the way I realize now I never explicitly gave the distribution for the problem I need to solve so this mix up is definitely my fault!)

The method you used is something I've seen but we were told to always partition one of the variables into two others, although I'm not really sure why. It seems like once you do this it is normally referenced as "complete data".

Anyway, when using the probabilities from the example instead of the ones in this post I get the following graph which seems to have the same result as the one you got. You are correct that the second graph is a log plot. If your method works the same then I don't understand why I have to go through this tedious mess of partitioning one variable to get "complete data". On page three of this pdf they use both methods it seems. #### Klaas van Aarsen

##### MHB Seeker
Staff member
I believe you are looking at the example problem. The problem I was given to solve was $x \sim M(n;0.25\theta,0.25\theta(2+\theta),0.5(1-2\theta),0.5\theta)$. Does my log likelihood function look good now?
Yes. Then it looks perfect. The method you used is something I've seen but we were told to always partition one of the variables into two others, although I'm not really sure why. It seems like once you do this it is normally referenced as "complete data".
I don't understand either yet, but see below.

Anyway, when using the probabilities from the example instead of the ones in this post I get the following graph which seems to have the same result as the one you got. You are correct that the second graph is a log plot.
Good!

If your method works the same then I don't understand why I have to go through this tedious mess of partitioning one variable to get "complete data". On page three of this pdf they use both methods it seems. It appears they are stepping up to a more complex problem than a standard maximum likelihood estimation.
The problem they describe is where you don't have all the data, but only a subset of it.
This is what would happen in genetics where you only observe the result of the dominant genes, but not the actual combination of the genes involved.
By splitting up the "incomplete data" into "complete data" they can obtain a more accurate estimate of the unknowns involved.

#### Jameson

Staff member
I see. So although we have complete data for this problem we can construct an algorithm that will work even when one value is missing?

Here is the slide concerning the "E step" of the EM algorithm in the example problem. I really want to know if this process can be simplified. I will continue reading and try to find out but on the quiz we had today I couldn't do this step.  #### Klaas van Aarsen

##### MHB Seeker
Staff member
I see. So although we have complete data for this problem we can construct an algorithm that will work even when one value is missing?
Actually, I think it's the other way around.
But then they say: suppose one of those parameters is really a combination of 2 invisible parameters, what will happen then?
With that more advanced model you don't have complete data anymore.

As long as we had complete data, we could do a "standard" maximum likelihood estimation (an "M" step).
With incomplete data, apparently we do "E" steps (expectation of the unknown data), and then we do the standard maximum likelihood estimation (the "M" step). After that we repeat the process.

Here is the slide concerning the "E step" of the EM algorithm in the example problem. I really want to know if this process can be simplified. I will continue reading and try to find out but on the quiz we had today I couldn't do this step. Which sub step in this "E" step is causing you problems?
Or is it that an "E" step is used at all?

#### Jameson

The $E$ step is just tedious and not something I feel will become very quick to calculate even if I did many of these problems, but I'm sure I would get better. After talking to some of my classmates about this they seem to feel the same, so it's not just me. I'm satisfied with my understanding of this method for now as we're already moving on to a different topic.