9/2/24
https://www.mathbootcamps.com/reading-scatterplots/
Model:
\[\begin{equation} Y_i = \beta_0 + \sum_{j=1}^P \beta_j * X_{ij} + \epsilon_i \\ \end{equation}\]
Independent normal errors with constant variance:
\[\begin{equation} \epsilon_i \overset{\text{iid}}{\sim} \text{Normal}(0, \sigma^2) \end{equation}\]
https://medium.com/analytics-vidhya/ordinary-least-square-ols-method-for-linear-regression-ef8ca10aadfc
Linearity: relationship btwn \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) is approximately linear
Normally distributed residuals
…simulations studies show that “sufficiently large” is often under 100, and even for our extremely nonNormal medical cost data it is less than 500. (Lumley et al. 2002)
Homoscedasticity (equal variance): the residuals have equal variance at every value of \(\boldsymbol{X}\)
Independence: residuals are independent (not correlated)
Other issues:
fit <- lm(formula = fev ~., data = fev)
sjPlot::tab_model(fit, ci.hyphen = ', ', p.style = 'scientific', digits.p = 2);
fev | |||
Predictors | Estimates | CI | p |
(Intercept) | -4.46 | -4.89, -4.02 | 1.07e-69 |
age | 0.07 | 0.05, 0.08 | 1.21e-11 |
height inches | 0.10 | 0.09, 0.11 | 4.98e-80 |
sex [Male] | 0.16 | 0.09, 0.22 | 2.74e-06 |
smoke [Yes] | -0.09 | -0.20, 0.03 | 1.41e-01 |
Observations | 654 | ||
R2 / R2 adjusted | 0.775 / 0.774 |
car::residualPlots
outputs a table which tests the linearity assumption of each continuous predictor. It reports the p-value for \(X_j^2\).stats::poly()
for uncorrelated polynomials; regular polynomials are usually highly correlatedcar::spreadLevelPlot
: \(log(|\text{studentized residuals}|)\) vs. \(log(\hat{y})\)
Suggested power transformation: 0.3772182
Suggested power transformation: -0.09245971
car::spreadLevelPlot()
prints a “Suggested power transformation” \(\tau\). Refit model with \(\boldsymbol{Y}^\tau\).Highly correlated predictors can increase \(Var(\hat{\beta})\), producing unreliable results
caret::findCorrelation
removes predictors with corr > cutoff
Variance inflation factors: \(\text{VIF}(X_j) = \frac{1}{1 - R^2_j}\), where \(R^2_j\) is % variance of \(X_j\) explained by all other predictors
quantreg::rq()
for median regressionSomeone should do a short talk on interactions!
By default, linear regression assumes no interactions between predictors.
You can manually add interaction terms to the model to investigate. A*B
in R formula gives A + B + A:B
# allow all variables to interact with Sex
fit.interaction <- lm(fev ~ (age + height.inches + smoke) * sex, data = fev);
sjPlot::tab_model(fit.interaction, ci.hyphen = ', ', p.style = 'scientific', digits.p = 2);
fev | |||
Predictors | Estimates | CI | p |
(Intercept) | -3.36 | -4.07, -2.65 | 1.93e-19 |
age | 0.06 | 0.03, 0.08 | 5.05e-06 |
height inches | 0.09 | 0.07, 0.10 | 7.83e-30 |
smoke [Yes] | -0.07 | -0.22, 0.08 | 3.75e-01 |
sex [Male] | -1.32 | -2.23, -0.41 | 4.48e-03 |
age * sex [Male] | 0.03 | -0.01, 0.06 | 1.39e-01 |
height inches * sex [Male] |
0.02 | 0.00, 0.04 | 4.07e-02 |
smoke [Yes] * sex [Male] | 0.02 | -0.22, 0.25 | 8.93e-01 |
Observations | 654 | ||
R2 / R2 adjusted | 0.786 / 0.783 |
Assumption | Assessment | Solution | |
---|---|---|---|
Linearity | car::residualPlots , want horizontal band around 0 for each predictor |
- Transform \(Y\) or \(X\) - GAM to automate linear/non-linear - Polynomials |
|
Normality of residuals | - Histogram/density plot - Normal QQ plot |
- Large N - Transform \(Y\) or \(X\) - Make sure linear assumption met - Bootstrap CIs: confint(car::Boot(fit)) |
|
Equal variance | - car::residualPlot and car::spreadLevelPlot , want horizontal band around 0 - car::ncvTest |
- Transform \(Y\) using exponent from spreadLevelPlot - Robust standard errors or bootstrap CI |
|
Independent residuals | Plot residuals vs time or other suspected clustering variables | - Robust sandwich standard errors for cluster effect - Linear mixed models |
|
Multicollinearity | - Check correlation between predictors - car::vif , want VIF < 10 |
- Given 2 highly corr. predictors, only keep 1 - caret::findCorrelation to remove predictors with corr > cutoff - PCA, regularized regression |
|
Influential obs | - Plot standardized residuals vs fitted values; |r| > 3 outlier - Cook’s distance, car::influenceIndexPlot |
- Sensitivity analysis fitting models with/without influential obs - Robust regression to downweight influential obs: quantreg::rq |
|
Interactions | - Manually add interaction terms, significant? | - Manually add interaction terms - Stratify model by potential interaction terms - ML models that automatically handle interactions |
car::residualPlots
to assess linearity of each predictor. Want to see horizontal band around 0 with no patterns. For non-linearity, use polynomials or GAM.car::vif
to assess multicollinearitycar::influenceIndexPlot
to assess influential observationscar::Boot
for robust bootstrap confidence intervals