Automatic feature selection and engineering with MARS

Jaron Arbet

9/2/24

Motivation

Build a flexible yet interpretable model like:

\[\begin{equation} \boldsymbol{y} = f(\boldsymbol{x}_1, ..., \boldsymbol{x}_P) + \boldsymbol{e} \end{equation}\]

General linear models (GLM), e.g. linear or logistic regression are interpretable and can be flexible, but you need to decide:

Which predictors to include?
For each predictor: linear or non-linear effect?
Do predictors interact? If yes, 2-way, 3-way, …?

MARS (multivariate adaptive regression splines) automatically determines all of this for you 😃

Piecewise linear functions

MARS uses simple piecewise linear functions (“splines”) that can approximate complex relationships

Knots: points in \(\boldsymbol{X}\) where effect on \(\boldsymbol{Y}\) (slope) changes

Everingham, Y. L., J. Sexton, and Jimmy White. “An introduction to multivariate adaptive regression splines for the cane industry.” Proceedings of the 2011 Conference of the Australian Society of Sugar Cane Technologists. 2011.
https://jekel.me/2017/Fit-a-piecewise-linear-function-to-data/
https://flexbooks.ck12.org/cbook/ck-12-interactive-algebra-1-for-ccss/section/1.4/primary/lesson/piecewise-linear-functions-alg-1-ccss/

Decisions for piecewise linear splines

Number of knots
Location of knots
How the slopes change at each knot

MARS automatically decides all 3! 😃

Hinge function

Main building block of MARS used to construct the piecewise linear functions
For predictor \(x\) and knot at \(x=t\), hinge fn has 2 parts (Hastie et al. 2009):

Hinge is a pair of terms: Left \((t-x)_+\) and Right \((x-t)_+\)
If MARS selects a given hinge fn, it is input to a GLM and estimates 2 coefficients: \(\beta_{Left}(t-x)_+ + \beta_{Right}(x-t)\)

Suppose knot \(t = 0.5\). Here is what the hinge \(\beta_{L}(t-x)_+ + \beta_{R}(x-t)\) looks like for various coeffients:

MARS algorithm

Forward: build a large # hinges that overfit the data
Backward: use backward VS to prune the model
Estimate the final coefficients in lm/glm

http://www.milbo.org/doc/earth-notes.pdf

Forward step

\(\boldsymbol{M}\) = set of terms in model. Start with just an intercept \(\{1\}\).
\(\boldsymbol{C}\) = set of candidate hinge functions to add to model. Contains hinge functions at each observed value for each predictor (\(N * P * 2\) total terms):

\[\begin{equation} \boldsymbol{C} = \big\{(X_j - t)_+, (t - X_j)_+\big\} \\ t \in \{x_{1j}, x_{2j}, ..., x_{Nj}\}; \ j = 1,2, ..., P \end{equation}\]

“At each stage we consider all products of a candidate hinge in \(\boldsymbol{C}\) with a hinge in the model \(\boldsymbol{M}\). The product that decreases the residual error the most is added into the current model.” (Hastie et al. 2009)

Thus at each step, it’s possible to add:

New variable
New knot to an existing variable in the model
Interaction term between 2 or more variables

Iterate.. stop once maximum number of terms is reached

Backwards step

Forward step purposely builds a large model that overfits
Backward step prunes the model to reduce overfitting:

The term whose removal causes the smallest increase in residual squared error is deleted from the model at each stage, producing an estimated best model \(f_\lambda\) of each size (number of terms) \(λ\) (Hastie et al. 2009)

The best models of size 1, 2, …, \(\lambda_{max}\) features are identified

Best? Measured by fast generalized cross validation (GCV) or more accurate but slower K-fold CV

GCV provides a convenient approximation to leave-one out cross-validation for linear models [without needing to split/resample/refit data](Hastie et al. 2009)

Tuning MARS

Potential tuning parameters:

Max degree of interactions allowed (set to 1 for none).
Max # terms in the Forward step (nk)
Max # of retained terms in Backward step (nprune)

Simplest tuning strategy:

Basically involves 0 tuning parameters, with a few caveats:
Set degree to a moderate value like 5 and use default nk
- GCV is used to automatically select nprune
- If “Reached max number of terms” then increase nk
- If many 5-way interactions, then increase degree

More advanced tuning stratigies use K-fold CV to optimize over a grid of all 3 parameters.

Example

Outcome = forced expiatory volume in liters (fev)
N = 654 youths age 3-19 from East Boston during 1970’s
Potential predictors: age, height, sex, smoking status
Goal: use MARS for both feature selection and engineering

Fitting MARS with `earth` R package

By default, uses GCV to select optimal number of terms

library(earth);
fit.gcv <- earth(
    formula = fev ~.,
    data = fev,
    degree = 5,
    keepxy = TRUE
    );
print(fit.gcv);

Selected 6 of 17 terms, and 4 of 4 predictors
Termination condition: Reached nk 21
Importance: height.inches, sexMale, age, smokeYes
Number of terms at each degree of interaction: 1 3 2
GCV 0.1487769    RSS 93.32458    GRSq 0.8024061    RSq 0.8098985

GRSq normalizes GCV from 0 to 1, similar to adjusted \(R^2\).
Selected all 4 predictors with 6 hinges to maximize GRSq

plot(fit.gcv, which = c(1));

Predictor effects

Let’s explore the predictor effects from fit.gcv:

Selected hinge functions and \(\hat{\beta}\):

                                    fev
(Intercept)                  2.74759130
h(65-height.inches)         -0.09188745
h(age-8)                     0.08327821
h(height.inches-65)*sexMale  0.24942931
h(height.inches-68)         -0.14391813
h(age-8)*smokeYes           -0.02834601

Variable importance scores:

evimp(fit.gcv);

              nsubsets   gcv    rss
height.inches        5 100.0  100.0
sexMale              4  46.1   46.5
age                  3  16.5   17.9
smokeYes             1   3.6    5.5

Partial dependence plots:

plotmo() can be used to plot the estimated effects
I prefer the pdp R package for making similar plots:

Extensions

MARS/earth can be used for continuous, count, binary or multinomial outcomes
- If multinomial, it’s recommended to also try using MARS with Flexable Discriminant Analaysis (FDA) - see “Notes on the earth package” for details.
In theory, MARS can handle missing values and time-to-event outcomes. However, I’m not aware of any R implementations that support this.
caret’s bagged MARS can improve prediction performance at the expense of interpretability

Summary

MARS automatically handles:

Feature engineering: what type of features to include? linear or non-linear, additive or interaction?
Feature selection: given a high-dimensional set of initial features, which should you include?

Other benefits:

Fast
Easy to tune: simplest approach has 0 tuning parameters
Interpretable
Handles mixed numeric/categorical predictors without needing further transformation (similar to tree-methods)
Robust to outliers in the predictors

Downsides?

In my experience, although MARS is more interpretable, it generally has lower predictive performance compared to Random Forests. Bagging and/or random variable sets (like RF) might improve this, but needs further investigation.
No statistical inference (p-values or confidence intervals). I have ideas for how to do this, talk with me if interested.
Although MARS in theory can handle missing values or time-to-event outcomes, I’m not aware of any free software that supports this. In contrast, many packages for tree-based models supports missing values and time-to-event outcomes (e.g. randomForestSRC R package).

Questions?

References

“Notes on the earth package”
Original MARS paper: Friedman, Jerome H. “Multivariate adaptive regression splines.” The annals of statistics 19.1 (1991): 1-67.
- A slightly more accessible Intro: Friedman, Jerome H., and Charles B. Roosen. “An introduction to multivariate adaptive regression splines.” Statistical methods in medical research 4.3 (1995): 197-217.
IMO, the Elements of Statistical Learning chapter on MARS is the best short introduction

Hastie, Trevor, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Vol. 2. Springer.