Causal inference with observational data

using Propensity score matching

Jaron Arbet

9/2/24

Background

Why?

RCTs are expensive:

  • Median cost of Phase 3 trials: $19 million (IQR: $12.2 - $33.1M)
  • Avg. cost of bringing new drug to market ~ 1 - 3 billion
  • May not generalize to larger “real world” population

May not be practical or ethically feasible

Real World

Real World Data RWD:

  • Observational, collected in “real world” setting
  • EHR database, hospital visit notes, wearable devices
  • Medical claims/billing database, disease registries

Real World Evidence RWE:

  • Clinical evidence about the benefits/harms of medical products derived from analyzing RWD (Franklin et al. 2021)
  • Poor quality RWD: garbage-in-garbage-out
  • 90 examples of RWE used by FDA

Potential outcomes causal framework

  • 2 exposure groups: e.g. Treatment vs Control
  • How does the exposure affect a given outcome \(Y\)?
  • The \(i\)th subject has 2 potential outcomes: \[Y_i(T) \text{ and } Y_i(C)\]
  • For each subject, the treatment effect is defined as: \[Y_i(T) - Y_i(C)\]
    • Only 1 of these is observed in reality

(Little and Rubin 2000; Rubin 2005)

Example

Subject \(Y_i(T)\) \(Y_i(C)\) Trt. Effect: \(Y_i(T) - Y_i(C)\)
1 14 ? ?
2 9 ? ?
3 8 ? ?
4 ? 5 ?
5 ? 10 ?
6 ? 7 ?
Mean 10.33 7.33 3
  • In RCT, \(\tau = \bar{Y}_i(T) - \bar{Y}_i(C)\) estimates a causal effect

  • In general, \(\tau\) is not causal for observational studies (OS)

Propensity scores (PS)

“Probability of treatment assignment based on observed baseline covariates” (Rosenbaum and Rubin 1983)

  • Given treatment variable \(X_i \in \{0,1\}\) and baseline covariates \(\boldsymbol{Z}_i\), then estimate PS:

\[ PS_i = Pr(X_i = 1) = f(\boldsymbol{Z}_i) + \epsilon_i \qquad(1)\]


Logistic regression or machine learning to estimate Equation 1

Why model treatment assignment?

Confounder \(\boldsymbol{Z}\) causes trt \(\boldsymbol{X}\) and outcome \(\boldsymbol{Y}\)

  • This makes \(\boldsymbol{X}\) correlated with \(\boldsymbol{Y}\), but correlation \(\neq\) causation

PS models the relation between \(\boldsymbol{X}\) and \(\boldsymbol{Z}\), thus removes confounding

https://sixsigmadsi.com/glossary/confounding/

Without a model for how treatments are assigned to units, formal causal inference is impossible (Little and Rubin 2000)

4 methods of using PS

Matching

The PS is a balancing score: patients with similar PS should have similar baseline covariates (Austin 2011a)

  • PS is a composite summary of all baseline covariates
  • Match Trt-Control patients with similar PS values

Matching: many choices

Recommendations from simulations of (Austin 2014a):

  • Optimal vs greedy nearest neighbor matching?

  • Greedy: iteratively match to nearest neighbor
  • Optimal: minimize the total distance between all pairs

Caliper or no caliper?

https://help.easymedstat.com/support/solutions/articles/77000538175-caliper-in-propensity-score-matching

  • How wide should caliper be? 0.2*SD of logit PS (Austin 2011b)


  • Match with or without replacement?


  • For greedy matching, what order should you select the treated subjects?
    • Lowest to highest PS, highest to lowest, best match first, or random order

Types of treatment effects

ATE: \(E\big[Y_i(T) - Y_i(C)\big]\)

  • Average effect for ALL patients in the population

ATT: \(E\big[Y_i(T) - Y_i(C)\big | T]\)

  • Among patients who were Treated, how would their outcomes have changed if they were a Control?

ATC: \(E\big[Y_i(T) - Y_i(C)\big | C]\)

  • Among patients who were Controls, how would their outcomes have changed if they were Treated?
Method ATE ATT/ATC
Matching
Stratification
Inverse probability weighting
Covariate adjustment

(Deb et al. 2016)

Example

Employment training and income

  • Treatment (N = 185): National Supported Work Demonstration (NSW) employment training program
  • Control (N = 429): from the Population Survey of Income Dynamics (PSID)
  • Goal: Does training program increase mean income (1978)?
  • Baseline covariates: age, education, race, marital status, pre-treatment income (1974/1975)

Nearest neighbor caliper matching

library(MatchIt);

set.seed(1234);
match.nnc.logit <- matchit(
    treat ~.,
    data = psdata,
    method = 'nearest',
    distance = 'glm',
    replace = FALSE,
    caliper = 0.2,
    std.caliper = TRUE,
    ratio = 1 # 1:1 matching
    );
  • MatchIt R package (Stuart 2011)
  • Default is logistic reg., but ?distance gives many other options (e.g. LASSO, random forests, boosting, NNets)
  • (Austin 2011b) recommends caliper = 0.2*SD logit PS

PS distribution before/after matching

  • Notice poor overlap before matching ➔ won’t match all Trt patients, unless \(N_{control}\) is large
Control Treated
Total 429 185
Matched 113 (26.3%) 113 (61.1%)
Unmatched 316 (73.7%) 72 (38.9%)
  • ⬆ unmatched Trt patients = ⬆ biased \(\widehat{\text{ATT}}\), instead estimates “Average treatment effect in the Overlap population (ATO)” (Varga et al. 2023)

Covariate balance

Standardized differences

Cutoffs:

Estimated treatment effect

trt.mean ctrl.mean diff conf.low conf.high p.value
6510 4938 1572 -365 3508 0.111

  • Interestingly, PS estimate of trt effect ($1572) is close to RCT estimate ($1641) (LaLonde 1986)

Optimal matching

  • Matches 100% of Trt patients (unlike greedy caliper match)
  • Attempts to optimize the total distance between all pairs
  • However, resulted in much worse covariate balance

Summary

  • Goal: estimate causal effect of exposure \(X\) on outcome \(Y\), using observational data
  • PS matching balances measured confounders between exposure groups (similar to RCT)
    • Intuitive: show covariates are balanced after matching, like RCT
    • Of course, unmeasured confounders may remain
  • If estimating ATT (ATC), then need to match all Trt (Control) patients. If high % unmatched, consider other methods.

Alternatives

  • Stratification on PS uses all patients, but does not work well with survival data (Austin 2014b, 2016)

  • Inverse probability weighting and regression adjustment are very flexible, but generally not accepted by FDA (unlike matching) (Lu 2019)

  • Causal Random Forests look promising: https://grf-labs.github.io/grf/

References

Austin, P. 2011a. “An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies.” Multivariate Behavioral Research 46 (3): 399–424.
———. 2011b. “Optimal Caliper Widths for Propensity-Score Matching When Estimating Differences in Means and Differences in Proportions in Observational Studies.” Pharmaceutical Statistics 10 (2): 150–61.
———. 2014a. “A Comparison of 12 Algorithms for Matching on the Propensity Score.” Statistics in Medicine 33 (6): 1057–69.
———. 2014b. “The Use of Propensity Score Methods with Survival or Time-to-Event Outcomes: Reporting Measures of Effect Similar to Those Used in Randomized Experiments.” Statistics in Medicine 33 (7): 1242–58.
———. 2016. “The Performance of Different Propensity Score Methods for Estimating Absolute Effects of Treatments on Survival Outcomes: A Simulation Study.” Statistical Methods in Medical Research 25 (5): 2214–37.
Chen, Jeffrey W, David R Maldonado, Brooke L Kowalski, Kara B Miecznikowski, Cynthia Kyin, Jeffrey A Gornbein, and Benjamin G Domb. 2022. “Best Practice Guidelines for Propensity Score Methods in Medical Research: Consideration on Theory, Implementation, and Reporting. A Review.” Arthroscopy: The Journal of Arthroscopic & Related Surgery 38 (2): 632–42.
Deb, Saswata, Peter C Austin, Jack V Tu, Dennis T Ko, C David Mazer, Alex Kiss, and Stephen E Fremes. 2016. “A Review of Propensity-Score Methods and Their Use in Cardiovascular Research.” Canadian Journal of Cardiology 32 (2): 259–65.
Dehejia, Rajeev H, and Sadek Wahba. 1999. “Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs.” Journal of the American Statistical Association 94 (448): 1053–62.
Franklin, Jessica M, Elisabetta Patorno, Rishi J Desai, Robert J Glynn, David Martin, Kenneth Quinto, Ajinkya Pawar, et al. 2021. “Emulating Randomized Clinical Trials with Nonrandomized Real-World Evidence Studies: First Results from the RCT DUPLICATE Initiative.” Circulation 143 (10): 1002–13.
Harder, V. 2010. “Propensity Score Techniques and the Assessment of Measured Covariate Balance to Test Causal Associations in Psychological Research.” Psychological Methods 15 (3): 234.
LaLonde, Robert J. 1986. “Evaluating the Econometric Evaluations of Training Programs with Experimental Data.” The American Economic Review, 604–20.
Little, Roderick J, and Donald B Rubin. 2000. “Causal Effects in Clinical and Epidemiological Studies via Potential Outcomes: Concepts and Analytical Approaches.” Annual Review of Public Health 21 (1): 121–45.
Lu, N. 2019. “Good Statistical Practice in Utilizing Real-World Data in a Comparative Study for Premarket Evaluation of Medical Devices.” Journal of Biopharmaceutical Statistics 29 (4): 580–91.
McDonald, Robert J, Jennifer S McDonald, David F Kallmes, and Rickey E Carter. 2013. “Behind the Numbers: Propensity Score Analysis—a Primer for the Diagnostic Radiologist.” Radiology 269 (3): 640–45.
Moore, Thomas J, Hanzhe Zhang, Gerard Anderson, and G Caleb Alexander. 2018. “Estimated Costs of Pivotal Trials for Novel Therapeutic Agents Approved by the US Food and Drug Administration, 2015-2016.” JAMA Internal Medicine 178 (11): 1451–57.
Rosenbaum, Paul R, and Donald B Rubin. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70 (1): 41–55.
Rubin, Donald B. 2005. “Causal Inference Using Potential Outcomes: Design, Modeling, Decisions.” Journal of the American Statistical Association 100 (469): 322–31.
Stuart, Elizabeth. 2011. “MatchIt: Nonparametric Preprocessing for Parametric Causal Inference.” Journal of Statistical Software.
Varga, Anita Natalia, Alejandra Elizabeth Guevara Morel, Joran Lokkerbol, Johanna Maria van Dongen, Maurits Willem van Tulder, and Judith Ekkina Bosmans. 2023. “Dealing with Confounding in Observational Studies: A Scoping Review of Methods Evaluated in Simulation Studies with Single-Point Exposure.” Statistics in Medicine 42 (4): 487–516.
Wouters, Olivier J, Martin McKee, and Jeroen Luyten. 2020. “Estimated Research and Development Investment Needed to Bring a New Medicine to Market, 2009-2018.” Jama 323 (9): 844–53.