Newton's method vs gradient descent

In summary, the conversation discusses the issue of finding the minimum of a 2D surface, with the speaker initially using a gradient descent algorithm but finding that it required careful selection of a step size and did not converge quickly. They then implemented Newton's method, which converged faster but had a smaller range of convergence. The speaker also mentions using a hybrid approach and considering the use of Levenberg-Marquardt method and the conjugate gradient method. Other methods discussed include linearizing the problem and using a least squares approach. The conversation ends with the speaker considering the benefits and drawbacks of each method and seeking feedback on their approach.
  • #1
Chuck37
52
0
I'm working on a problem where I need to find minimum of a 2D surface. I initially coded up a gradient descent algorithm, and though it works, I had to carefully select a step size (which could be problematic), plus I want it to converge quickly. So, I went through immense pain to derive the Hessian and implemented Newton's method. What I'm finding is that while Newton's method can converge impressively fast (a handful of iterations), it has a much smaller range of convergence. The initial guess must be closer to the correct answer.

Is this a known thing, or might I be doing something wrong? I can make sense of what's happening by looking at the slope of my error surface (the thing I'm trying to minimize). What happens is that at a certain distance from the minimum, the slope rate "goes negative", thus the inclusion of the second derivative drives me further from the correct answer.

As an example of what I mean, consider a 1D case where I'm trying to find the minimum of f=cos(x) between 0 and 2pi (looks like a bowl, kind of). If I just look at the slope I can see that a gradient based approach would converge anywhere within [0,2*pi] - the slope is negative below pi (the answer) and positive above.

If I look at f'' though, I see that it goes negative below pi/2 and above 1.5*pi. Thus if I start outside of that region, f'/f'' drives me the wrong way.

I guess I'm just disappointed since I thought Newton's method was hands down better than gradient. I want to make sure I'm not missing something. Maybe this is just the price you pay for the faster convergence?

Off the top of my head, I can't think of a case where it would be beneficial to have the 1/f'' reverse the direction of movement. Would it make sense to watch for this case and revert to gradient descent when it happens? Why would you want to go uphill?
 
Physics news on Phys.org
  • #2
Newton's Method is famous for not working well if the starting point is 'far' from the solution. A pratical thing you can do is use the Levenberg-Marquardt method (i.e. add a small fixed value to the diagonal of the Hessian matrix). The larger the value the more it behaves like the gradient descent method. After a few iterations you should be able to reduce the adder (to zero) and get the great convergence expected with Newton's Method.
 
  • #3
Thanks for the reply. In my case loading the diagonal really won't cut it since the diagonal terms can get quite negative at times. I found an article on Newton's method that stated flatly that the Hessian should be positive definite, so I'm putting that check into my code. That avoids flat out divergence, but barely positive definite Hessians are still allowing highly suboptimal updates to the estimate.

I'm working a hybrid gradient/Newton approach, just trying to determine when to switch to Newton. Switching early takes longer than just letting gradient descent coast all the way in, but if I wait until just the right time to switch - and use all or most of the Newton step - I can see significant improvement.
 
  • #4
Some years back I had good success with the conjugate gradient method for a system of non linear differential equations for gas pipe networks.

The programs then were all in Fortran.
 
  • #5
Chuck37 said:
I'm working a hybrid gradient/Newton approach, just trying to determine when to switch to Newton. Switching early takes longer than just letting gradient descent coast all the way in, but if I wait until just the right time to switch - and use all or most of the Newton step - I can see significant improvement.

Here's a thought. Use gradient descent until Hessian is barely positive, then load the diagonals for a few iterations, then pure Newton. Loading the diagonal is a solution method that is in between gradient descent and Newton's method. The trick is knowing how much to add to the diagonal.
 
  • #6
I think my hybrid is working pretty well. I'm reading in a textbook about an iterative least squares technique. You basically linearize around your current estimate and then do a full least squares solution to get a new update. It looks to me pretty much like a Kalman filter with all the dynamic state stuff stripped out (no new information is coming in and the solution is known to be fixed). Can anyone comment on how this compares to gradient/Newton type techniques? I'm going to code it up to check it out, but I can't seem to find much about it on the web.
 
  • #7
Are you saying that the surface you are trying to minimize is really a nonlinear least squares problem?
 
  • #8
hotvette said:
Are you saying that the surface you are trying to minimize is really a nonlinear least squares problem?

Yes, I guess I am. I coded it up as an iterative non-linear least squares thing and found the requirements for accuracy of the initial estimate to be quite strict. Modifying the algorithm to use a step size allowed it to converge in some cases, but then it just looked like a bad version of gradient descent.
 
  • #9
Hmmm, let me see if I have this straight:

1. You are trying to solve a nonlinear least squares problem

2. Steepest Descent method works but convergence is slow

3. Treating the problem as a general optimization problem (i.e. using true Hessian and solving Hx = -J) converges rapidly but has severe restrictions on starting point. By the way, I've never seen any literature suggesting this method for solving least squares problems, though I remember trying it once.

4. Hybrid version (2 & 3) works well

Your last post suggests that you've also tried a 4th approach (i.e. classic nonlinear least squares approach iteratively solving JTJx = -JT(f-y). The beauty of this approach is that you can avoid the actual computation of JTJ and JT(f-y) by using QR factorization of J and solving Rx = QT(f-y) instead. By the way, J is different than the J in Hx = -J)

So you've tried all 4 methods?
 
Last edited:
  • #10
Let's make sure my terminology is right regarding nonlinear least squares.

I have two non-linear equations: f1(u,v)=a, f2(u,v)=b and I want to solve for (u,v). I'm setting it up as a weighted MMSE problem; minimize w1*(f1-a)^2 + w2*(f2-b)^2 over (u,v).

Note that I can't generally rely on having an initial estimate close to the right answer.

I have tried:

1) Gradient descent. Converges over a wide range, but selecting step size is kludgy. I get my direction and search for the furthest I can go while still seeing improvement.

2) Newton's method with full computed Hessian. Converges rapidly and elegantly when it converges. Has a much smaller region of convergence than gradients.

3) Hybrid. Use gradient descent until Hessian is positive definite then switch over to Newton. Works.

4) Iterative nonlinear least squares (as you describe). Very narrow region of convergence. Practical implementation would require setting down a grid of initial estimates and seeing which ones converged (to the same place or to someplace reasonable).

Incidentally, I have seen people smarter than me using Newton's method for just this type of thing, so I know I'm not totally in left field!
 
  • #11
Chuck37 said:
I have two non-linear equations: f1(u,v)=a, f2(u,v)=b and I want to solve for (u,v)

I'm not familiar with the weighted approach. The standard way I've seen for solving simultaneous nonlinear equations (f1=0, f2=0, ..., fn=0) is to iteratively solve Jx = -F. I remember solving a set of 7 simultaneous nonlinear equations this way in a uni class.

I'd love to play with it. Can you tell me the function details either here or PM?
 
Last edited:
  • #12
What computer environment (eg Matlab) are you using for this?
 
  • #13
I'm prototyping in Matlab. It would eventually end up in C.

hotvette, I'm sorry but I don't think I should provide details on the project. It's not really my project after all. It would be pretty complicated to explain in any case.
 
  • #14
Have you looked on the Mathworld website for Matlab resources?

http://www.mathworks.com/access/helpdesk/help/toolbox/optim/ug/brn4nq6.html
 
Last edited by a moderator:
  • #15
Chuck37 said:
I'm sorry but I don't think I should provide details on the project

No worries. I just like challenges :smile:. Unless there is very good reason to take the weighted MMSE approach, you might consider the more straightforward method I mentioned for solving simultaneous nonlinear equations.
 
  • #16
hotvette said:
No worries. I just like challenges :smile:. Unless there is very good reason to take the weighted MMSE approach, you might consider the more straightforward method I mentioned for solving simultaneous nonlinear equations.

Solving it with weighting is not really much tougher. I just pulled the equations out of a textbook. Mostly the same except with some W's floating around. :wink: The main issue was just the requirement for a good initial guess compared to the other methods.

Thanks for the help.
 

Related to Newton's method vs gradient descent

What is Newton's method?

Newton's method is an algorithm used to find the roots of a function. It uses the derivative of the function to find the optimal solution by taking small steps towards the root.

What is gradient descent?

Gradient descent is an optimization algorithm used to find the minimum value of a function. It takes small steps in the direction of the steepest descent, which is determined by the gradient (or slope) of the function at a specific point.

What is the main difference between Newton's method and gradient descent?

The main difference between Newton's method and gradient descent is the way in which they find the optimal solution. Newton's method uses the derivative of the function, while gradient descent uses the gradient. This makes Newton's method faster and more accurate for finding the root of a function, but gradient descent is better suited for finding the minimum of a function.

When should I use Newton's method over gradient descent?

Newton's method is most effective when the function is smooth and has a single root. It is also useful when the function has a complex shape or multiple local minima. However, it may not converge or give accurate results if the function is not well-behaved or has multiple roots.

When should I use gradient descent over Newton's method?

Gradient descent is more suitable for functions that are not well-behaved or have multiple minima. It is also useful for large datasets, as Newton's method can become computationally expensive. Additionally, gradient descent can be easily extended to handle constraints, while Newton's method cannot.

Similar threads

Replies
18
Views
2K
Replies
3
Views
1K
  • Calculus
Replies
13
Views
1K
  • General Math
Replies
5
Views
877
  • Engineering and Comp Sci Homework Help
Replies
2
Views
1K
Replies
1
Views
1K
Replies
1
Views
2K
  • Calculus and Beyond Homework Help
Replies
8
Views
527
  • General Math
Replies
7
Views
2K
Back
Top