Are polynomial features the root of all evil? (2024) by Areibman

Share This Article

Sed ut perspiciatis unde.

When fitting a non-linear model using linear regression, we typically generate new features using non-linear functions. We also know that any function, in theory, can be approximated by a sufficiently high degree polynomial. This result is known as Weierstrass approximation theorem. But many blogs, papers, and even books tell us that high polynomials should be avoided. They tend to oscilate and overfit, and regularization doesn’t help! They even scare us with images, such as the one below, when the polynomial fit using the data points (in red) is far away from the true function (in blue):

It turns out that it’s just a MYTH. There’s nothing inherently wrong with high degree polynomials, and in contrast to what is typically taught, high degree polynomials are easily controlled using standard ML tools, like regularization. The source of the myth stems mainly from two misconceptions about polynomials that we will explore here. In fact, not only they are great non-linear features, certain representations also provide us with powerful control over the shape of the function we wish to learn.

A colab notebook with the code for reproducing the above results is available here.

Vladimir Vapnik, in his famous book “The Nature of Statistical Learning Theory” which is cited more than 100,000 times as of today, coined the approximation vs. estimation balance. The approximation power of a model is its ability to represent the “reality” we would like to learn. Typically, approximation power increases with the complexity of the model – more parameters mean more power to represent any function to arbitrary precision. Polynomials are no different – higher degree polynomials can represent functions to higher accuracy. However, more parameters make it difficult to estimate these parameters from the data.

Indeed, higher degree polynomials have a higher capacity to approximate arbitrary functions. And since they have more coefficients, these coefficients are harder to estimate from data. But how does it differ from other non-linear features, such as the well-known radial basis functions? Why do polynomials have such a bad reputation? Are they truly hard to estimate from data?

It turns out that the primary source is the standard polynomial basis for n-degree polynomials (mathbb{E}_n = {1, x, x^2, …, x^n}). Indeed, any degree (n) polynomial can be written as a linear combination of these functions:

[alpha_0 cdot 1 + alpha_1 cdot x + alpha_2 cdot x^2 + cdots + alpha_n x^n]

But the standard basis (mathbb{B}_n) is awful for estimating polynomials from data. In this post we will explore other ways to represent polynomials that are appropriate for machine learning, and are readily available in standard Python packages. We note, that one advantage of polynomials over other non-linear feature bases is that the only hyperparameter is their degree. There is no “kernel width”, like in radial basis functions¹.

The second source of their bad reputation is misunderstanding of Weierstrass’ approximation theorem. It’s usually cited as “polynomials can approximate arbitrary continuous functions”. But that’s not entrely true. They can approximate arbitrary continuous functions in an interval. This means that when using polynomial features, the data must be normalized to lie in an interval. It can be done using min-max scaling, computing empirical quantiles, or passing the feature through a sigmoid. But we should avoid the use of polynomials on raw un-normalized features.

In this post we will demonstrate fitting the function

[f(x)=sin(8 pi x) / exp(x)+x]

on the interval ([0, 1]) by fitting to (m=30) samples corrupted by Gaussian noise. The following code implements the function and generates samples:

import numpy as np

def true_func(x):
  return np.sin(8 * np.pi * x) / np.exp(x) + x

m = 30
sigma = 0.1

# generate features
np.random.seed(42)
X = np.random.rand(m)
y = true_func(X) + sigma * np.random.randn(m)

For function plotting, we will use uniformly-spaced points in ([0, 1]). The following code plots the true function and the sample points:

import matplotlib.pyplot as plt

plt_xs = np.linspace(0, 1, 1000)
plt.scatter(X.ravel(), y.ravel())
plt.plot(plt_xs, true_func(plt_xs), 'blue')
plt.show()

Now let’s fit a polynomial to the sampled points using the standard basis. Namely, we’re given the set of noisy points ({ (x_i, y_i) }_{i=1}^m), and we need to find the coefficients (alpha_0, dots, alpha_n) that minimize:

[sum_{i=1}^m (alpha_0 + alpha_1 x_i + dots + alpha_n x_i^n – y_i)^2]

As expected, this is readily accomplished by transforming each sample (x_i) to a vector of features (1, x_i, dots, x_i^n), and fitting a linear regression model to the resulting features. Fortunately, NumPy has the numpy.polynomial.polynomial.polyvanderfunction. It takes a vector containing (x_1, dots, x_m) and produces the matrix

[begin{pmatrix}
1 & x_1 & x_1^2 & dots & x_1^n \
1 & x_2 & x_2^2 & dots & x_2^n \
vdots & vdots & vdots & ddots & vdots \
1 & x_m & x_m^2 & dots & x_m^n \
end{pmatrix}]

The name of the function comes from the name of the matrix – the Vandermonde matrix. Let’s use it to fit a polynomial of degree (n=50).

from sklearn.linear_model import LinearRegression
import numpy.polynomial.polynomial as poly

n = 50
model = LinearRegression(fit_intercept=False)
model.fit(poly.polyvander(X, deg=n), y)

The reason we use fit_intercept=False is because the ‘intercept’ is provided by the first column of the Vandermonde matrix. Now we can plot the function we just fit:

plt.scatter(X.ravel(), y.ravel())                                    # plot the samples
plt.plot(plt_xs, true_func(plt_xs), 'blue')                          # plot the true function
plt.plot(plt_xs, model.predict(poly.polyvander(plt_xs, deg=n)), 'r') # plot the fit model
plt.ylim([-5, 5])
plt.show()

As expected, we got the “scary” image from the beginning of this post. Indeed, the standard basis is awful for model fitting! We hope that regularization provides a remedy, but it does not. Maybe adding some L2 regularization helps? Let’s use the Ridge class from the sklearn.linear_model package to fit an L2 regularized model:

from sklearn.linear_model import Ridge

reg_coef = 1e-7
model = Ridge(fit_intercept=False, alpha=reg_coef)
model.fit(poly.polyvander(X, deg=n), y)

plt.scatter(X.ravel(), y.ravel())                                    # plot the samples
plt.plot(plt_xs, true_func(plt_xs), 'blue')                          # plot the true function
plt.plot(plt_xs,

Post Author

creata

Posted April 22, 2025 at 5:05 pm

A well-known related paper that I didn't see mentioned in the article (although Trefethen was mentioned) is "Six Myths of Polynomial Interpolation and Quadrature".

https://people.maths.ox.ac.uk/trefethen/mythspaper.pdf

0Likes Log in to Reply
Post Author

ComplexSystems

Posted April 22, 2025 at 5:24 pm

Great article! Very curious how this orthogonalization + regularization idea could be extended to other kinds of series, such as Fourier series.

0Likes Log in to Reply
Post Author

SkyBelow

Posted April 22, 2025 at 5:32 pm

One thought I had while looking into regression recently is to consider the model created with a given regularization coefficient not as a line on a 2 dimensional graph but as a slice of a surface on a 3 dimensions graph where the third dimension is the regularization coefficient.

In my case the model was for logistic regression and it was the boundary lines of the classification, but the thought is largely the same. Viewing it as a 3d shape form by boundary lines and considering hill tops as areas where entire classification boundaries disappeared as the regularization coefficient grew large enough to eliminate them. Impractical to do on models of any size and only useful when looking at two features at a time, but a fun consideration.

More on topic with the article, how well does this work with considering multiple features and the different combinations of them. Instead of sigma(n => 50) of x^n, what happens if you have sigma(n => 50) of sigma(m => 50) of (x^n)*(y^n). Well probably less than 50 in the second example, maybe it is fair to have n and m go to 7 so there are 49 total terms compared to the original 50, instead of 2500 terms if they both go to 50.

0Likes Log in to Reply
Post Author

FabHK

Posted April 22, 2025 at 5:38 pm

This article is much better, more informative, and factual than I'd have expected from the title. Note that it's part of an 8-article series.

Worth a read if you're ever fitting functions.

0Likes Log in to Reply
Post Author

xg15

Posted April 22, 2025 at 6:01 pm

Great article and clever use of linkbait!

0Likes Log in to Reply
Post Author

PaulHoule

Posted April 22, 2025 at 6:25 pm

See also https://en.wikipedia.org/wiki/Chebyshev_polynomials

They're the kind of math which is non-obvious and a bit intricate but yet if you knew the basics of how they worked and you were bored you might sit down with pencil and paper and derive everything about them. Then you wouldn't be bored anymore.

0Likes Log in to Reply
Post Author

tc4v

Posted April 22, 2025 at 6:38 pm

good article, but I am very bother by the "standard basis"… it's called canonical in math. I don't think standard is the right name in any context.

0Likes Log in to Reply
Post Author

petters

Posted April 22, 2025 at 6:48 pm

This part about double decent is really good: https://alexshtf.github.io/2025/03/27/Free-Poly.html#fnref:2

0Likes Log in to Reply
Post Author

ForceBru

Posted April 22, 2025 at 7:01 pm

In another post (https://alexshtf.github.io/2025/03/27/Free-Poly.html) the author fits a degree-10000 (ten thousand!) polynomial using the Legendre basis. The polynomial _doesn't overfit_, demonstrating double descent. "What happened to our overfitting from ML 101 textbooks? There is no regularization. No control of the degree. But “magically” our high degree polynomial is not that bad!"

So… are _all_ introductions to machine learning just extremely wrong here? I feel like I've seen tens of reputable books and courses that introduce overfitting and generalization using severe overfitting and terrible generalization of high-degree polynomials in the usual basis (1,x,x^2,…). Seemingly everyone warns of the dangers of high-degree polynomials, yet here the author just says "use another basis" and proves everyone wrong? Mind blown, or is there a catch?

0Likes Log in to Reply
Post Author

programjames

Posted April 22, 2025 at 7:18 pm

This is why I believe a numerical methods course should be a requirement for any AI majors.

0Likes Log in to Reply
Post Author

fancyfredbot

Posted April 22, 2025 at 7:53 pm

Wanted to add my voice to the chorus of appreciation for this article (actually a series of 8). Very informative and engaging.

0Likes Log in to Reply
Post Author

ziofill

Posted April 22, 2025 at 8:02 pm

In this case wouldn't a Fourier-type approach work better? At least there's no risk the function blows up and it possibly needs fewer parameters?

0Likes Log in to Reply
Post Author

constantcrying

Posted April 22, 2025 at 8:31 pm

I completely disagree with the conclusion of the article. The reason the examples worked so well is because of an arbitrary choice, which went completely uncommented.

The interval was chosen as 0 to 1. This single fact was what made this feasible. Had the interval been chosen as 0 to 10. A degree 100 polynomial would have to calculate 10^100, this would have lead to drastic numerical errors.

The article totally fails to give any of the totally legitimate and very important reason why high degree polynomials are dangerous. It is absurd to say that well known numerical problems do not exist because you just found one example where they did not occur.

0Likes Log in to Reply
Post Author

constantcrying

Posted April 22, 2025 at 8:35 pm

It should also be noted that this kind of fitting is extremely closely related to integration or in other words calculating the mean.

By using the "right" kind of polynomial basis you can get a polynomial approximation which also tells you the mean and variance of the function under a random variable.

0Likes Log in to Reply

Are polynomial features the root of all evil? (2024) by Areibman

Are polynomial features the root of all evil? (2024) by Areibman

Share This Article

Newsletter

HackTech

14 Comments

creata

ComplexSystems

SkyBelow

FabHK

xg15

PaulHoule

tc4v

petters

ForceBru

programjames

fancyfredbot

ziofill

constantcrying

constantcrying

Leave a comment Cancel reply

Editor's Choice

Are polynomial features the root of all evil? (2024) by Areibman

Are polynomial features the root of all evil? (2024) by Areibman

Share This Article

Newsletter

14 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter