A simple example of Feature Selection and Model Drift
The Dow-Jones Industrial Average (DJIA) was first introduced by Charles Dow in 1896 and has since become one of the main references for stock market performance on the New York Stock Exchange.
In this post, we will use it to better understand the pros and cons of simple Linear Regression models, the assumptions they rely on, and how we can use them for feature selection.
The value of the Dow is what is known as a price-weighted index, the average of the price of 30 well-known stocks. The components of the Dow change over time, making this a particularly well-suited example for our purposes.
For the sake of simplicity, we’ll focus on the values of the DJIA during 2020 as that includes the crash induced by the first set of CoVID lockdowns (see here for a deep data-driven look at CoVID).
In order to be able to better evaluate our model, we’ll split our dataset into Training and Testing periods. In this way, we can train the model using only the data in the Training period and get an estimate of how well it does with data it has never seen before by comparing the models predictions with the actual data in the testing dataset. We arbitrarily choose to use the period before Aug 1st, 2020 for Training and the rest of the year for testing:
Now we’re all set to train our model. As we’re interested in learning more about feature selection, we start by training a model with 35 individual stocks to see if we’re able to successfully determine which ones to eliminate.
We use the wonderful statsmodels Python package to perform an Ordinary Least Squares (OLS) Linear Regression fit. The details of how OLS works are not important for our purposes. Suffice it to say that OLS works by minimizing the squared difference between the predicted and empirical values.
statsmodels provides us with a detailed statistical analysis for each of the coefficients of our model along with many other statistical results about the quality of the resulting model.
Here we see 6 columns of results for each of the model features. In particular, we have
- coeff — The estimate of this model parameter (the weight assigned to this feature)
- std err — The standard error of our estimate (
From these two values we can compute the t-score of our estimate:
which we can find on the third column. This value essentially provides us with a metric of how small the error is with respect to the estimated value: the larger the t-score, the smaller the error and the more confident we can be in our estimate.
To better quantify our confidence, it’s usual to compute the associated p-value. Under relatively common assumptions, we expect the t-score to follow a standard distribution and the probability of obtaining test results at least as extreme as the value observed is simply the area under the curve to the right of the t-score value. This area is known as the p-value and we can find it in the fourth column of the results above.
There are some nuances to interpreting p-values but briefly, the smaller the p-value, the stronger the evidence that the value we’re estimating is different than zero (if the coefficient of a given feature is indistinguishable from zero then that feature is not relevant for our model).
Typical thresholds for the p-value are:
- p