
Are polynomial features the root of all evil? (2024) by Areibman
When fitting a non-linear model using linear regression, we typically generate new features using non-linear functions. We also know that any function, in theory, can be approximated by a sufficiently high degree polynomial. This result is known as Weierstrass approximation theorem. But many blogs, papers, and even books tell us that high polynomials should be avoided. They tend to oscilate and overfit, and regularization doesn’t help! They even scare us with images, such as the one below, when the polynomial fit using the data points (in red) is far away from the true function (in blue):
It turns out that it’s just a MYTH. There’s nothing inherently wrong with high degree polynomials, and in contrast to what is typically taught, high degree polynomials are easily controlled using standard ML tools, like regularization. The source of the myth stems mainly from two misconceptions about polynomials that we will explore here. In fact, not only they are great non-linear features, certain representations also provide us with powerful control over the shape of the function we wish to learn.
A colab notebook with the code for reproducing the above results is available here.
Vladimir Vapnik, in his famous book “The Nature of Statistical Learning Theory” which is cited more than 100,000 times as of today, coined the approximation vs. estimation balance. The approximation power of a model is its ability to represent the “reality” we would like to learn. Typically, approximation power increases with the complexity of the model – more parameters mean more power to represent any function to arbitrary precision. Polynomials are no different – higher degree polynomials can represent functions to higher accuracy. However, more parameters make it difficult to estimate these parameters from the data.
Indeed, higher degree polynomials have a higher capacity to approximate arbitrary functions. And since they have more coefficients, these coefficients are harder to estimate from data. But how does it differ from other non-linear features, such as the well-known radial basis functions? Why do polynomials have such a bad reputation? Are they truly hard to estimate from data?
It turns out that the primary source is the standard polynomial basis for n-degree polynomials (mathbb{E}_n = {1, x, x^2, …, x^n}). Indeed, any degree (n) polynomial can be written as a linear combination of these functions:
[alpha_0 cdot 1 + alpha_1 cdot x + alpha_2 cdot x^2 + cdots + alpha_n x^n]
But the standard basis (mathbb{B}_n) is awful for estimating polynomials from data. In this post we will explore other ways to represent polynomials that are appropriate for machine learning, and are readily available in standard Python packages. We note, that one advantage of polynomials over other non-linear feature bases is that the only hyperparameter is their degree. There is no “kernel width”, like in radial basis functions1.
The second source of their bad reputation is misunderstanding of Weierstrass’ approximation theorem. It’s usually cited as “polynomials can approximate arbitrary continuous functions”. But that’s not entrely true. They can approximate arbitrary continuous functions in an interval. This means that when using polynomial features, the data must be normalized to lie in an interval. It can be done using min-max scaling, computing empirical quantiles, or passing the feature through a sigmoid. But we should avoid the use of polynomials on raw un-normalized features.
In this post we will demonstrate fitting the function
[f(x)=sin(8 pi x) / exp(x)+x]
on the interval ([0, 1]) by fitting to (m=30) samples corrupted by Gaussian noise. The following code implements the function and generates samples:
import numpy as np
def true_func(x):
return np.sin(8 * np.pi * x) / np.exp(x) + x
m = 30
sigma = 0.1
# generate features
np.random.seed(42)
X = np.random.rand(m)
y = true_func(X) + sigma * np.random.randn(m)
For function plotting, we will use uniformly-spaced points in ([0, 1]). The following code plots the true function and the sample points:
import matplotlib.pyplot as plt
plt_xs = np.linspace(0, 1, 1000)
plt.scatter(X.ravel(), y.ravel())
plt.plot(plt_xs, true_func(plt_xs), 'blue')
plt.show()
Now let’s fit a polynomial to the sampled points using the standard basis. Namely, we’re given the set of noisy points ({ (x_i, y_i) }_{i=1}^m), and we need to find the coefficients (alpha_0, dots, alpha_n) that minimize:
[sum_{i=1}^m (alpha_0 + alpha_1 x_i + dots + alpha_n x_i^n – y_i)^2]
As expected, this is readily accomplished by transforming each sample (x_i) to a vector of features (1, x_i, dots, x_i^n), and fitting a linear regression model to the resulting features. Fortunately, NumPy has the numpy.polynomial.polynomial.polyvander
function. It takes a vector containing (x_1, dots, x_m) and produces the matrix
[begin{pmatrix}
1 & x_1 & x_1^2 & dots & x_1^n \
1 & x_2 & x_2^2 & dots & x_2^n \
vdots & vdots & vdots & ddots & vdots \
1 & x_m & x_m^2 & dots & x_m^n \
end{pmatrix}]
The name of the function comes from the name of the matrix – the Vandermonde matrix. Let’s use it to fit a polynomial of degree (n=50).
from sklearn.linear_model import LinearRegression
import numpy.polynomial.polynomial as poly
n = 50
model = LinearRegression(fit_intercept=False)
model.fit(poly.polyvander(X, deg=n), y)
The reason we use fit_intercept=False
is because the ‘intercept’ is provided by the first column of the Vandermonde matrix. Now we can plot the function we just fit:
plt.scatter(X.ravel(), y.ravel()) # plot the samples
plt.plot(plt_xs, true_func(plt_xs), 'blue') # plot the true function
plt.plot(plt_xs, model.predict(poly.polyvander(plt_xs, deg=n)), 'r') # plot the fit model
plt.ylim([-5, 5])
plt.show()
As expected, we got the “scary” image from the beginning of this post. Indeed, the standard basis is awful for model fitting! We hope that regularization provides a remedy, but it does not. Maybe adding some L2 regularization helps? Let’s use the Ridge
class from the sklearn.linear_model
package to fit an L2 regularized model:
from sklearn.linear_model import Ridge
reg_coef = 1e-7
model = Ridge(fit_intercept=False, alpha=reg_coef)
model.fit(poly.polyvander(X, deg=n), y)
plt.scatter(X.ravel(), y.ravel()) # plot the samples
plt.plot(plt_xs, true_func(plt_xs), 'blue') # plot the true function
plt.plot(plt_xs,
14 Comments
creata
A well-known related paper that I didn't see mentioned in the article (although Trefethen was mentioned) is "Six Myths of Polynomial Interpolation and Quadrature".
https://people.maths.ox.ac.uk/trefethen/mythspaper.pdf
ComplexSystems
Great article! Very curious how this orthogonalization + regularization idea could be extended to other kinds of series, such as Fourier series.
SkyBelow
One thought I had while looking into regression recently is to consider the model created with a given regularization coefficient not as a line on a 2 dimensional graph but as a slice of a surface on a 3 dimensions graph where the third dimension is the regularization coefficient.
In my case the model was for logistic regression and it was the boundary lines of the classification, but the thought is largely the same. Viewing it as a 3d shape form by boundary lines and considering hill tops as areas where entire classification boundaries disappeared as the regularization coefficient grew large enough to eliminate them. Impractical to do on models of any size and only useful when looking at two features at a time, but a fun consideration.
More on topic with the article, how well does this work with considering multiple features and the different combinations of them. Instead of sigma(n => 50) of x^n, what happens if you have sigma(n => 50) of sigma(m => 50) of (x^n)*(y^n). Well probably less than 50 in the second example, maybe it is fair to have n and m go to 7 so there are 49 total terms compared to the original 50, instead of 2500 terms if they both go to 50.
FabHK
This article is much better, more informative, and factual than I'd have expected from the title. Note that it's part of an 8-article series.
Worth a read if you're ever fitting functions.
xg15
Great article and clever use of linkbait!
PaulHoule
See also https://en.wikipedia.org/wiki/Chebyshev_polynomials
They're the kind of math which is non-obvious and a bit intricate but yet if you knew the basics of how they worked and you were bored you might sit down with pencil and paper and derive everything about them. Then you wouldn't be bored anymore.
tc4v
good article, but I am very bother by the "standard basis"… it's called canonical in math. I don't think standard is the right name in any context.
petters
This part about double decent is really good: https://alexshtf.github.io/2025/03/27/Free-Poly.html#fnref:2
ForceBru
In another post (https://alexshtf.github.io/2025/03/27/Free-Poly.html) the author fits a degree-10000 (ten thousand!) polynomial using the Legendre basis. The polynomial _doesn't overfit_, demonstrating double descent. "What happened to our overfitting from ML 101 textbooks? There is no regularization. No control of the degree. But “magically” our high degree polynomial is not that bad!"
So… are _all_ introductions to machine learning just extremely wrong here? I feel like I've seen tens of reputable books and courses that introduce overfitting and generalization using severe overfitting and terrible generalization of high-degree polynomials in the usual basis (1,x,x^2,…). Seemingly everyone warns of the dangers of high-degree polynomials, yet here the author just says "use another basis" and proves everyone wrong? Mind blown, or is there a catch?
programjames
This is why I believe a numerical methods course should be a requirement for any AI majors.
fancyfredbot
Wanted to add my voice to the chorus of appreciation for this article (actually a series of 8). Very informative and engaging.
ziofill
In this case wouldn't a Fourier-type approach work better? At least there's no risk the function blows up and it possibly needs fewer parameters?
constantcrying
I completely disagree with the conclusion of the article. The reason the examples worked so well is because of an arbitrary choice, which went completely uncommented.
The interval was chosen as 0 to 1. This single fact was what made this feasible. Had the interval been chosen as 0 to 10. A degree 100 polynomial would have to calculate 10^100, this would have lead to drastic numerical errors.
The article totally fails to give any of the totally legitimate and very important reason why high degree polynomials are dangerous. It is absurd to say that well known numerical problems do not exist because you just found one example where they did not occur.
constantcrying
It should also be noted that this kind of fitting is extremely closely related to integration or in other words calculating the mean.
By using the "right" kind of polynomial basis you can get a polynomial approximation which also tells you the mean and variance of the function under a random variable.