Regression Diagnostics

Jaron Arbet

9/2/24

Overview

  • What is linear regression?
  • Assumptions
  • Diagnostics and remedies for failed assumptions

Linear regression

  • Continuous response: \(\big\{Y_i\big\}_{i=1}^N = (Y_1, ...., Y_N)\)
  • Predictors: \(\boldsymbol{X}_i = (X_{i1}, ..., X_{iP})\)

https://www.mathbootcamps.com/reading-scatterplots/

Model:

\[\begin{equation} Y_i = \beta_0 + \sum_{j=1}^P \beta_j * X_{ij} + \epsilon_i \\ \end{equation}\]

Independent normal errors with constant variance:

\[\begin{equation} \epsilon_i \overset{\text{iid}}{\sim} \text{Normal}(0, \sigma^2) \end{equation}\]

Ordinary Least Squares (OLS)

  • Estimate \(\hat{\boldsymbol{\beta}}\) such that “sum of squared residuals” is minimized: \(\sum_{i=1}^n(y_i - \hat{y})^2\), where \(\hat{y}_i = \hat{\beta}_0 + \sum_{j=1}^P \hat{\beta}_j * X_{ij}\)

https://medium.com/analytics-vidhya/ordinary-least-square-ols-method-for-linear-regression-ef8ca10aadfc

Assumptions

  1. Linearity: relationship btwn \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) is approximately linear

  2. Normally distributed residuals

    • OR the sample size is large (Central Limit Theorem)

    …simulations studies show that “sufficiently large” is often under 100, and even for our extremely nonNormal medical cost data it is less than 500. (Lumley et al. 2002)

  3. Homoscedasticity (equal variance): the residuals have equal variance at every value of \(\boldsymbol{X}\)

  4. Independence: residuals are independent (not correlated)

Other issues:

  • Multicollinearity: highly correlated predictors can greatly increase Var(\(\hat{\beta}\))
  • Influencial observations/outliers can bias results
  • Additivity: by default, assumes no interactions btwn predictors. Need to manually add interaction terms.

Example dataset

  • Outcome = forced expiatory volume in liters (fev)
  • N = 654 youths age 3-19 from East Boston during 1970’s
GGally::ggpairs(fev, showStrips = TRUE)

Initial model with all variables

fit <- lm(formula = fev ~., data = fev)
sjPlot::tab_model(fit, ci.hyphen = ',&nbsp;', p.style = 'scientific', digits.p = 2);
  fev
Predictors Estimates CI p
(Intercept) -4.46 -4.89, -4.02 1.07e-69
age 0.07 0.05, 0.08 1.21e-11
height inches 0.10 0.09, 0.11 4.98e-80
sex [Male] 0.16 0.09, 0.22 2.74e-06
smoke [Yes] -0.09 -0.20, 0.03 1.41e-01
Observations 654
R2 / R2 adjusted 0.775 / 0.774

Linearity

  • Plot residuals (\(y - \hat{y}\)) vs. each predictor and \(\hat{y}\). Want to see horizontal band around 0 with no patterns.
  • Curvature in the plots suggests non-linear relationship

car::residualPlots(fit);

              Test stat Pr(>|Test stat|)    
age              5.0256        6.500e-07 ***
height.inches    7.6489        7.354e-14 ***
sex                                         
smoke                                       
Tukey test       8.3559        < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

  • car::residualPlots outputs a table which tests the linearity assumption of each continuous predictor. It reports the p-value for \(X_j^2\).

Remedies for non-linearity

  • Data transformation: transforming the outcome and/or predictor to make more normally distributed may help
  • GAM: Generalized Additive Model automatically models linear/non-linear effects using smoothing splines: previous lab talks: 2022-12-08 and 2018-04-27
  • Polynomials: e.g. \(Age + Age^2 + Age^3 + ...\)
    • use stats::poly() for uncorrelated polynomials; regular polynomials are usually highly correlated
fit.poly <- lm(fev ~ poly(age, 2) + poly(height.inches, 2) + sex + smoke, data = fev);
car::residualPlots(fit.poly, tests = FALSE);

  • Note the adjusted \(R^2\) was 0.77 and 0.79 for the linear and polynomial models

Normality of residuals or large N

  • Simulations and refs from (Lumley et al. 2002) suggest our N = 654 is sufficient. Nevertheless, you can visually check normality below.
  • P-value tests of Normality are not recommended: low power in the scenario you care about (small N) but usually significant in the scenario you don’t care about (large N)

Remedies for non-normal residuals

  • Large \(N\)
  • Data transformation of outcome and/or predictors
  • Make sure linearity assumption is met
  • Bootstrap confidence intervals for robust inference:
set.seed(123);
bootstrap <- car::Boot(fit, method = 'case');
round(confint(bootstrap), 3);
Bootstrap bca confidence intervals

               2.5 % 97.5 %
(Intercept)   -4.936 -3.940
age            0.046  0.087
height.inches  0.093  0.114
sexMale        0.103  0.231
smokeYes      -0.249  0.068

Homoscedasticity (equal variance)

  • Plot residuals vs fitted values: want flat/horizontal band around 0
  • Funnel pattern like < or > indicates heteroscedasticity
par(mfrow = c(1,2))
car::residualPlot(fit, main = 'Linear fit');
car::residualPlot(fit.poly, main = 'Polynomial fit');

car::spreadLevelPlot: \(log(|\text{studentized residuals}|)\) vs. \(log(\hat{y})\)

  • Flat horizontal line means equal variance

Suggested power transformation:  0.3772182 


Suggested power transformation:  -0.09245971 
car::ncvTest(fit);
Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 93.4299, Df = 1, p = < 2.22e-16
car::ncvTest(fit.poly);
Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 127.954, Df = 1, p = < 2.22e-16

Remedies for unequal variance

  • Transform \(\boldsymbol{Y}\): car::spreadLevelPlot() prints a “Suggested power transformation” \(\tau\). Refit model with \(\boldsymbol{Y}^\tau\).
  • Unequal variance affects \(Var({\hat{\beta}})\) but not \(\hat{\beta}\). Thus, can use robust standard errors or bootstrap for CIs/pvalues:
# robust standard errors
robust.se <- lmtest::coeftest(
    x = fit, 
    vcov = sandwich::vcovHC(fit)
    );
confint(robust.se);

Independent residuals

  • Correlated residuals can occur from repeated measures, or when patients cluster by some group (e.g. family, hospital)
  • Plot residuals vs time or other suspected clustering variables
  • Remedies: robust sandwich standard errors to account for cluster effect; or linear mixed models

Multicollinearity

  • Highly correlated predictors can increase \(Var(\hat{\beta})\), producing unreliable results

  • caret::findCorrelation removes predictors with corr > cutoff

  • Variance inflation factors: \(\text{VIF}(X_j) = \frac{1}{1 - R^2_j}\), where \(R^2_j\) is % variance of \(X_j\) explained by all other predictors

    • VIF > 10 is a common cutoff
car::vif(fit);
          age height.inches           sex         smoke 
     3.019010      2.829728      1.060228      1.209564 
  • Other remedies include PCA and regularized regression

Influential observations

  • Extreme values in \(\boldsymbol{Y}\) and/or \(\boldsymbol{X}\) can highly influence results
  • \(|\text{Standardized residuals}| > 3\) indicates potential outlier (If normally distributed, expect 99.7% < 3)
plot(
    x = fitted(fit),
    y = rstandard(fit),
    ylab = 'Standardized Residuals',
    xlab = 'Fitted values'
    );
abline(h = -3, lty = 2);
abline(h = 3, lty = 2);

Cook’s distance

  • \(D_i = \frac{\sum_j(\hat{y}_j - \hat{y}_{j(i)})^2}{(p+1) * \hat{\sigma}^2}\)
  • \(D_i\) is proportional to the distance that the predicted values would move if the \(i\)th patient was excluded
  • Various cutoffs have been used in the literature, e.g. 1, \(\frac{4}{N}\), or \(\frac{4}{N - p - 1}\).. or visually identify patients with relatively large vals

car::influenceIndexPlot(fit, vars = 'Cook');

Remedies for influential observations

Interactions

  • Someone should do a short talk on interactions!

  • By default, linear regression assumes no interactions between predictors.

  • You can manually add interaction terms to the model to investigate. A*B in R formula gives A + B + A:B

# allow all variables to interact with Sex
fit.interaction <- lm(fev ~ (age + height.inches + smoke) * sex, data = fev);
sjPlot::tab_model(fit.interaction, ci.hyphen = ',&nbsp;', p.style = 'scientific', digits.p = 2);
  fev
Predictors Estimates CI p
(Intercept) -3.36 -4.07, -2.65 1.93e-19
age 0.06 0.03, 0.08 5.05e-06
height inches 0.09 0.07, 0.10 7.83e-30
smoke [Yes] -0.07 -0.22, 0.08 3.75e-01
sex [Male] -1.32 -2.23, -0.41 4.48e-03
age * sex [Male] 0.03 -0.01, 0.06 1.39e-01
height inches * sex
[Male]
0.02 0.00, 0.04 4.07e-02
smoke [Yes] * sex [Male] 0.02 -0.22, 0.25 8.93e-01
Observations 654
R2 / R2 adjusted 0.786 / 0.783

  • If you suspect many interactions, might be better to use a machine learning model that automatically handles interactions like decision-trees or Random Forest

Summary

Assumption Assessment Solution
Linearity car::residualPlots, want horizontal band around 0 for each predictor - Transform \(Y\) or \(X\)
- GAM to automate linear/non-linear
- Polynomials
Normality of residuals - Histogram/density plot
- Normal QQ plot
- Large N
- Transform \(Y\) or \(X\)
- Make sure linear assumption met
- Bootstrap CIs: confint(car::Boot(fit))
Equal variance - car::residualPlot and car::spreadLevelPlot, want horizontal band around 0
- car::ncvTest
- Transform \(Y\) using exponent from spreadLevelPlot
- Robust standard errors or bootstrap CI
Independent residuals Plot residuals vs time or other suspected clustering variables - Robust sandwich standard errors for cluster effect
- Linear mixed models
Multicollinearity - Check correlation between predictors
- car::vif, want VIF < 10
- Given 2 highly corr. predictors, only keep 1
- caret::findCorrelation to remove predictors with corr > cutoff
- PCA, regularized regression
Influential obs - Plot standardized residuals vs fitted values; |r| > 3 outlier
- Cook’s distance, car::influenceIndexPlot
- Sensitivity analysis fitting models with/without influential obs
- Robust regression to downweight influential obs: quantreg::rq
Interactions - Manually add interaction terms, significant? - Manually add interaction terms
- Stratify model by potential interaction terms
- ML models that automatically handle interactions

Extensions to General Linear Models

  • General linear model (GLM): for binary, multi-category, ordinal, or count outcomes
  • car::residualPlots to assess linearity of each predictor. Want to see horizontal band around 0 with no patterns. For non-linearity, use polynomials or GAM.
  • car::vif to assess multicollinearity
  • car::influenceIndexPlot to assess influential observations
  • car::Boot for robust bootstrap confidence intervals
  • GLM also makes assumptions about the variance.. if false, can use bootstrap or robust standard errors:
lmtest::coeftest(fit, vcov = sandwich::vcovHC(fit))

References

Lumley, Thomas, Paula Diehr, Scott Emerson, and Lu Chen. 2002. “The Importance of the Normality Assumption in Large Public Health Data Sets.” Annual Review of Public Health 23 (1): 151–69.