code
share




4.4 Two Natural Weaknesses of Newton’s Method*




* The following is part of an early draft of the second edition of Machine Learning Refined. The published text (with revised material) is now available on Amazon as well as other major book retailers. Instructors may request an examination copy from Cambridge University Press.

Newton's method is a powerful algorithm that makes enormous progress towards at each step (unlike e.g., zero and first order methods that can require a huge number of steps to make equal progress). However Newton's method suffers from two fundmanetal problems that limit its popularity in certain fields of machine learning, which we discuss here. These problems involve its application to minimizing non-convex functions, and issues scaling with input dimension.

In [1]:

Application to minimizing non-convex functions

As discussed in the previous Section, Newton's method can behave very badly when applied to minimizing non-convex functions. Since each step is based off of the second order approximation to a function, if the curvature at a point is convcave Newton's method will naturally take a step uphill.

On the other hand because the second order approximation used by Newton's method contains so much local information than e.g., an analagous first order step, when applied to convex functions Newton's method can converge to a global minimum in far fewer steps - particularly when it is close to a minimum (since often the quadratic approximation matches a convex functions locally very well).

Scaling limitations of Newton's method

A Newton's method step requires far more in terms of storage and computation than a first order step - requiring the storage and computation of not just a gradient but an entire $N\times N$ Hessian matrix of second derivative information as well. Simply storing the Hessian for a single step of Newton's method, with its $N^2$ entries, can quickly become challenging for what moderately sized input. For example if the input to a function has dimension $N = 10,000$ the corresponding $10,000 \times 10,000$ Hessian matrix has $10,000^2 = 100,000,000$ entries. The kind of functions used in machine learning applications can easily have tens of thousands to hundreds of thousands or even hundreds of millions of inputs, making the complete storage of an associated Hessian impossible.