Automatic feature selection and engineering with MARS

Jaron Arbet

9/2/24

Motivation

  • Build a flexible yet interpretable model like:

\[\begin{equation} \boldsymbol{y} = f(\boldsymbol{x}_1, ..., \boldsymbol{x}_P) + \boldsymbol{e} \end{equation}\]

General linear models (GLM), e.g. linear or logistic regression are interpretable and can be flexible, but you need to decide:

  1. Which predictors to include?
  2. For each predictor: linear or non-linear effect?
  3. Do predictors interact? If yes, 2-way, 3-way, …?

MARS (multivariate adaptive regression splines) automatically determines all of this for you 😃

Piecewise linear functions

MARS uses simple piecewise linear functions (“splines”) that can approximate complex relationships

  • Knots: points in \(\boldsymbol{X}\) where effect on \(\boldsymbol{Y}\) (slope) changes

  • Everingham, Y. L., J. Sexton, and Jimmy White. “An introduction to multivariate adaptive regression splines for the cane industry.” Proceedings of the 2011 Conference of the Australian Society of Sugar Cane Technologists. 2011.
  • https://jekel.me/2017/Fit-a-piecewise-linear-function-to-data/
  • https://flexbooks.ck12.org/cbook/ck-12-interactive-algebra-1-for-ccss/section/1.4/primary/lesson/piecewise-linear-functions-alg-1-ccss/

Decisions for piecewise linear splines

  1. Number of knots
  2. Location of knots
  3. How the slopes change at each knot
  • MARS automatically decides all 3! 😃

Hinge function

  • Main building block of MARS used to construct the piecewise linear functions
  • For predictor \(x\) and knot at \(x=t\), hinge fn has 2 parts (Hastie et al. 2009):

  • Hinge is a pair of terms: Left \((t-x)_+\) and Right \((x-t)_+\)
  • If MARS selects a given hinge fn, it is input to a GLM and estimates 2 coefficients: \(\beta_{Left}(t-x)_+ + \beta_{Right}(x-t)\)

Suppose knot \(t = 0.5\). Here is what the hinge \(\beta_{L}(t-x)_+ + \beta_{R}(x-t)\) looks like for various coeffients:

MARS algorithm

  1. Forward: build a large # hinges that overfit the data
  2. Backward: use backward VS to prune the model
  3. Estimate the final coefficients in lm/glm

http://www.milbo.org/doc/earth-notes.pdf

Forward step

  • \(\boldsymbol{M}\) = set of terms in model. Start with just an intercept \(\{1\}\).
  • \(\boldsymbol{C}\) = set of candidate hinge functions to add to model. Contains hinge functions at each observed value for each predictor (\(N * P * 2\) total terms):

\[\begin{equation} \boldsymbol{C} = \big\{(X_j - t)_+, (t - X_j)_+\big\} \\ t \in \{x_{1j}, x_{2j}, ..., x_{Nj}\}; \ j = 1,2, ..., P \end{equation}\]

“At each stage we consider all products of a candidate hinge in \(\boldsymbol{C}\) with a hinge in the model \(\boldsymbol{M}\). The product that decreases the residual error the most is added into the current model.” (Hastie et al. 2009)

Thus at each step, it’s possible to add:

  • New variable
  • New knot to an existing variable in the model
  • Interaction term between 2 or more variables

Iterate.. stop once maximum number of terms is reached

Backwards step

  • Forward step purposely builds a large model that overfits
  • Backward step prunes the model to reduce overfitting:

The term whose removal causes the smallest increase in residual squared error is deleted from the model at each stage, producing an estimated best model \(f_\lambda\) of each size (number of terms) \(λ\) (Hastie et al. 2009)

The best models of size 1, 2, …, \(\lambda_{max}\) features are identified

Best? Measured by fast generalized cross validation (GCV) or more accurate but slower K-fold CV

GCV provides a convenient approximation to leave-one out cross-validation for linear models [without needing to split/resample/refit data](Hastie et al. 2009)

Tuning MARS

Potential tuning parameters:

  1. Max degree of interactions allowed (set to 1 for none).
  2. Max # terms in the Forward step (nk)
  3. Max # of retained terms in Backward step (nprune)

Simplest tuning strategy:

  • Basically involves 0 tuning parameters, with a few caveats:
  • Set degree to a moderate value like 5 and use default nk
    • GCV is used to automatically select nprune
    • If “Reached max number of terms” then increase nk
    • If many 5-way interactions, then increase degree

More advanced tuning stratigies use K-fold CV to optimize over a grid of all 3 parameters.

Example

  • Outcome = forced expiatory volume in liters (fev)
  • N = 654 youths age 3-19 from East Boston during 1970’s
  • Potential predictors: age, height, sex, smoking status
  • Goal: use MARS for both feature selection and engineering

Fitting MARS with earth R package

By default, uses GCV to select optimal number of terms

library(earth);
fit.gcv <- earth(
    formula = fev ~.,
    data = fev,
    degree = 5,
    keepxy = TRUE
    );
print(fit.gcv);
Selected 6 of 17 terms, and 4 of 4 predictors
Termination condition: Reached nk 21
Importance: height.inches, sexMale, age, smokeYes
Number of terms at each degree of interaction: 1 3 2
GCV 0.1487769    RSS 93.32458    GRSq 0.8024061    RSq 0.8098985
  • GRSq normalizes GCV from 0 to 1, similar to adjusted \(R^2\).
  • Selected all 4 predictors with 6 hinges to maximize GRSq
plot(fit.gcv, which = c(1));

Predictor effects

Let’s explore the predictor effects from fit.gcv:

Selected hinge functions and \(\hat{\beta}\):

                                    fev
(Intercept)                  2.74759130
h(65-height.inches)         -0.09188745
h(age-8)                     0.08327821
h(height.inches-65)*sexMale  0.24942931
h(height.inches-68)         -0.14391813
h(age-8)*smokeYes           -0.02834601

Variable importance scores:

evimp(fit.gcv);
              nsubsets   gcv    rss
height.inches        5 100.0  100.0
sexMale              4  46.1   46.5
age                  3  16.5   17.9
smokeYes             1   3.6    5.5

Partial dependence plots:

  • plotmo() can be used to plot the estimated effects
  • I prefer the pdp R package for making similar plots:

Extensions

  • MARS/earth can be used for continuous, count, binary or multinomial outcomes
  • In theory, MARS can handle missing values and time-to-event outcomes. However, I’m not aware of any R implementations that support this.
  • caret’s bagged MARS can improve prediction performance at the expense of interpretability

Summary

MARS automatically handles:

  • Feature engineering: what type of features to include? linear or non-linear, additive or interaction?
  • Feature selection: given a high-dimensional set of initial features, which should you include?

Other benefits:

  • Fast
  • Easy to tune: simplest approach has 0 tuning parameters
  • Interpretable
  • Handles mixed numeric/categorical predictors without needing further transformation (similar to tree-methods)
  • Robust to outliers in the predictors

Downsides?

  1. In my experience, although MARS is more interpretable, it generally has lower predictive performance compared to Random Forests. Bagging and/or random variable sets (like RF) might improve this, but needs further investigation.
  2. No statistical inference (p-values or confidence intervals). I have ideas for how to do this, talk with me if interested.
  3. Although MARS in theory can handle missing values or time-to-event outcomes, I’m not aware of any free software that supports this. In contrast, many packages for tree-based models supports missing values and time-to-event outcomes (e.g. randomForestSRC R package).

Questions?

References

  • “Notes on the earth package”
  • Original MARS paper: Friedman, Jerome H. “Multivariate adaptive regression splines.” The annals of statistics 19.1 (1991): 1-67.
    • A slightly more accessible Intro: Friedman, Jerome H., and Charles B. Roosen. “An introduction to multivariate adaptive regression splines.” Statistical methods in medical research 4.3 (1995): 197-217.
  • IMO, the Elements of Statistical Learning chapter on MARS is the best short introduction
Hastie, Trevor, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Vol. 2. Springer.