Regression Diagnostics

Jaron Arbet

9/2/24

Overview

What is linear regression?
Assumptions
Diagnostics and remedies for failed assumptions

Linear regression

Continuous response: \(\big\{Y_i\big\}_{i=1}^N = (Y_1, ...., Y_N)\)
Predictors: \(\boldsymbol{X}_i = (X_{i1}, ..., X_{iP})\)

https://www.mathbootcamps.com/reading-scatterplots/

Model:

\[\begin{equation} Y_i = \beta_0 + \sum_{j=1}^P \beta_j * X_{ij} + \epsilon_i \\ \end{equation}\]

Independent normal errors with constant variance:

\[\begin{equation} \epsilon_i \overset{\text{iid}}{\sim} \text{Normal}(0, \sigma^2) \end{equation}\]

Ordinary Least Squares (OLS)

Estimate \(\hat{\boldsymbol{\beta}}\) such that “sum of squared residuals” is minimized: \(\sum_{i=1}^n(y_i - \hat{y})^2\), where \(\hat{y}_i = \hat{\beta}_0 + \sum_{j=1}^P \hat{\beta}_j * X_{ij}\)

https://medium.com/analytics-vidhya/ordinary-least-square-ols-method-for-linear-regression-ef8ca10aadfc

Assumptions

Linearity: relationship btwn \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) is approximately linear
Normally distributed residuals
- OR the sample size is large (Central Limit Theorem)
…simulations studies show that “sufficiently large” is often under 100, and even for our extremely nonNormal medical cost data it is less than 500. (Lumley et al. 2002)
Homoscedasticity (equal variance): the residuals have equal variance at every value of \(\boldsymbol{X}\)
Independence: residuals are independent (not correlated)

Other issues:

Multicollinearity: highly correlated predictors can greatly increase Var(\(\hat{\beta}\))
Influencial observations/outliers can bias results
Additivity: by default, assumes no interactions btwn predictors. Need to manually add interaction terms.

Example dataset

Outcome = forced expiatory volume in liters (fev)
N = 654 youths age 3-19 from East Boston during 1970’s

GGally::ggpairs(fev, showStrips = TRUE)

Initial model with all variables

fit <- lm(formula = fev ~., data = fev)
sjPlot::tab_model(fit, ci.hyphen = ',&nbsp;', p.style = 'scientific', digits.p = 2);

	fev
Predictors	Estimates	CI	p
(Intercept)	-4.46	-4.89, -4.02	1.07e-69
age	0.07	0.05, 0.08	1.21e-11
height inches	0.10	0.09, 0.11	4.98e-80
sex [Male]	0.16	0.09, 0.22	2.74e-06
smoke [Yes]	-0.09	-0.20, 0.03	1.41e-01
Observations	654
R² / R² adjusted	0.775 / 0.774

Linearity

Plot residuals (\(y - \hat{y}\)) vs. each predictor and \(\hat{y}\). Want to see horizontal band around 0 with no patterns.
Curvature in the plots suggests non-linear relationship

car::residualPlots(fit);

              Test stat Pr(>|Test stat|)    
age              5.0256        6.500e-07 ***
height.inches    7.6489        7.354e-14 ***
sex                                         
smoke                                       
Tukey test       8.3559        < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

car::residualPlots outputs a table which tests the linearity assumption of each continuous predictor. It reports the p-value for \(X_j^2\).

Remedies for non-linearity

Data transformation: transforming the outcome and/or predictor to make more normally distributed may help
GAM: Generalized Additive Model automatically models linear/non-linear effects using smoothing splines: previous lab talks: 2022-12-08 and 2018-04-27
Polynomials: e.g. \(Age + Age^2 + Age^3 + ...\)
- use stats::poly() for uncorrelated polynomials; regular polynomials are usually highly correlated

fit.poly <- lm(fev ~ poly(age, 2) + poly(height.inches, 2) + sex + smoke, data = fev);
car::residualPlots(fit.poly, tests = FALSE);

Note the adjusted \(R^2\) was 0.77 and 0.79 for the linear and polynomial models

Normality of residuals or large N

Simulations and refs from (Lumley et al. 2002) suggest our N = 654 is sufficient. Nevertheless, you can visually check normality below.
P-value tests of Normality are not recommended: low power in the scenario you care about (small N) but usually significant in the scenario you don’t care about (large N)

Remedies for non-normal residuals

Large \(N\)
Data transformation of outcome and/or predictors
Make sure linearity assumption is met
Bootstrap confidence intervals for robust inference:

set.seed(123);
bootstrap <- car::Boot(fit, method = 'case');
round(confint(bootstrap), 3);

Bootstrap bca confidence intervals

               2.5 % 97.5 %
(Intercept)   -4.936 -3.940
age            0.046  0.087
height.inches  0.093  0.114
sexMale        0.103  0.231
smokeYes      -0.249  0.068

Homoscedasticity (equal variance)

Plot residuals vs fitted values: want flat/horizontal band around 0
Funnel pattern like < or > indicates heteroscedasticity

par(mfrow = c(1,2))
car::residualPlot(fit, main = 'Linear fit');
car::residualPlot(fit.poly, main = 'Polynomial fit');

car::spreadLevelPlot: \(log(|\text{studentized residuals}|)\) vs. \(log(\hat{y})\)

Flat horizontal line means equal variance


Suggested power transformation:  0.3772182


Suggested power transformation:  -0.09245971

car::ncvTest(fit);

Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 93.4299, Df = 1, p = < 2.22e-16

car::ncvTest(fit.poly);

Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 127.954, Df = 1, p = < 2.22e-16

Remedies for unequal variance

Transform \(\boldsymbol{Y}\): car::spreadLevelPlot() prints a “Suggested power transformation” \(\tau\). Refit model with \(\boldsymbol{Y}^\tau\).
Unequal variance affects \(Var({\hat{\beta}})\) but not \(\hat{\beta}\). Thus, can use robust standard errors or bootstrap for CIs/pvalues:

# robust standard errors
robust.se <- lmtest::coeftest(
    x = fit, 
    vcov = sandwich::vcovHC(fit)
    );
confint(robust.se);

Independent residuals

Correlated residuals can occur from repeated measures, or when patients cluster by some group (e.g. family, hospital)
Plot residuals vs time or other suspected clustering variables
Remedies: robust sandwich standard errors to account for cluster effect; or linear mixed models

Multicollinearity

Highly correlated predictors can increase \(Var(\hat{\beta})\), producing unreliable results
caret::findCorrelation removes predictors with corr > cutoff
Variance inflation factors: \(\text{VIF}(X_j) = \frac{1}{1 - R^2_j}\), where \(R^2_j\) is % variance of \(X_j\) explained by all other predictors
- VIF > 10 is a common cutoff

car::vif(fit);

          age height.inches           sex         smoke 
     3.019010      2.829728      1.060228      1.209564

Other remedies include PCA and regularized regression

Influential observations

Extreme values in \(\boldsymbol{Y}\) and/or \(\boldsymbol{X}\) can highly influence results
\(|\text{Standardized residuals}| > 3\) indicates potential outlier (If normally distributed, expect 99.7% < 3)

plot(
    x = fitted(fit),
    y = rstandard(fit),
    ylab = 'Standardized Residuals',
    xlab = 'Fitted values'
    );
abline(h = -3, lty = 2);
abline(h = 3, lty = 2);

Cook’s distance

\(D_i = \frac{\sum_j(\hat{y}_j - \hat{y}_{j(i)})^2}{(p+1) * \hat{\sigma}^2}\)
\(D_i\) is proportional to the distance that the predicted values would move if the \(i\)th patient was excluded
Various cutoffs have been used in the literature, e.g. 1, \(\frac{4}{N}\), or \(\frac{4}{N - p - 1}\).. or visually identify patients with relatively large vals

car::influenceIndexPlot(fit, vars = 'Cook');

Remedies for influential observations

Sensitivity analysis: fit 2 models that include or exclude influential obs. Do the results substantially change?
Robust regression: include all patients but downweight influential observations
- R: https://www.john-fox.ca/Companion/appendices/Appendix-Robust-Regression.pdf
- quantreg::rq() for median regression

Interactions

Someone should do a short talk on interactions!
By default, linear regression assumes no interactions between predictors.
You can manually add interaction terms to the model to investigate. A*B in R formula gives A + B + A:B

# allow all variables to interact with Sex
fit.interaction <- lm(fev ~ (age + height.inches + smoke) * sex, data = fev);
sjPlot::tab_model(fit.interaction, ci.hyphen = ',&nbsp;', p.style = 'scientific', digits.p = 2);

	fev
Predictors	Estimates	CI	p
(Intercept)	-3.36	-4.07, -2.65	1.93e-19
age	0.06	0.03, 0.08	5.05e-06
height inches	0.09	0.07, 0.10	7.83e-30
smoke [Yes]	-0.07	-0.22, 0.08	3.75e-01
sex [Male]	-1.32	-2.23, -0.41	4.48e-03
age * sex [Male]	0.03	-0.01, 0.06	1.39e-01
height inches * sex [Male]	0.02	0.00, 0.04	4.07e-02
smoke [Yes] * sex [Male]	0.02	-0.22, 0.25	8.93e-01
Observations	654
R² / R² adjusted	0.786 / 0.783

If you suspect many interactions, might be better to use a machine learning model that automatically handles interactions like decision-trees or Random Forest

Summary

Assumption	Assessment	Solution
Linearity	`car::residualPlots`, want horizontal band around 0 for each predictor	- Transform \(Y\) or \(X\) - GAM to automate linear/non-linear - Polynomials
Normality of residuals	- Histogram/density plot - Normal QQ plot	- Large N - Transform \(Y\) or \(X\) - Make sure linear assumption met - Bootstrap CIs: `confint(car::Boot(fit))`
Equal variance	- `car::residualPlot` and `car::spreadLevelPlot`, want horizontal band around 0 - `car::ncvTest`	- Transform \(Y\) using exponent from `spreadLevelPlot` - Robust standard errors or bootstrap CI
Independent residuals	Plot residuals vs time or other suspected clustering variables	- Robust sandwich standard errors for cluster effect - Linear mixed models
Multicollinearity	- Check correlation between predictors - `car::vif`, want VIF < 10	- Given 2 highly corr. predictors, only keep 1 - `caret::findCorrelation` to remove predictors with corr > cutoff - PCA, regularized regression
Influential obs	- Plot standardized residuals vs fitted values; \|r\| > 3 outlier - Cook’s distance, `car::influenceIndexPlot`	- Sensitivity analysis fitting models with/without influential obs - Robust regression to downweight influential obs: `quantreg::rq`
Interactions	- Manually add interaction terms, significant?	- Manually add interaction terms - Stratify model by potential interaction terms - ML models that automatically handle interactions

Extensions to General Linear Models

General linear model (GLM): for binary, multi-category, ordinal, or count outcomes
car::residualPlots to assess linearity of each predictor. Want to see horizontal band around 0 with no patterns. For non-linearity, use polynomials or GAM.
car::vif to assess multicollinearity
car::influenceIndexPlot to assess influential observations
car::Boot for robust bootstrap confidence intervals
GLM also makes assumptions about the variance.. if false, can use bootstrap or robust standard errors:

lmtest::coeftest(fit, vcov = sandwich::vcovHC(fit))

References

Lumley, Thomas, Paula Diehr, Scott Emerson, and Lu Chen. 2002. “The Importance of the Normality Assumption in Large Public Health Data Sets.” Annual Review of Public Health 23 (1): 151–69.