Causal inference with observational data

using Propensity score matching

Jaron Arbet

9/2/24

Background

Why?

RCTs are expensive:

Median cost of Phase 3 trials: $19 million (IQR: $12.2 - $33.1M)
Avg. cost of bringing new drug to market ~ 1 - 3 billion
May not generalize to larger “real world” population

May not be practical or ethically feasible

Real World

Real World Data RWD:

Observational, collected in “real world” setting
EHR database, hospital visit notes, wearable devices
Medical claims/billing database, disease registries

Real World Evidence RWE:

Clinical evidence about the benefits/harms of medical products derived from analyzing RWD (Franklin et al. 2021)
Poor quality RWD: garbage-in-garbage-out
90 examples of RWE used by FDA

Potential outcomes causal framework

2 exposure groups: e.g. Treatment vs Control
How does the exposure affect a given outcome $Y$?
The $i$th subject has 2 potential outcomes: \[Y_i(T) \text{ and } Y_i(C)\]
For each subject, the treatment effect is defined as: \[Y_i(T) - Y_i(C)\]
- Only 1 of these is observed in reality

(Little and Rubin 2000; Rubin 2005)

Example

Subject	$Y_i(T)$	$Y_i(C)$	Trt. Effect: $Y_i(T) - Y_i(C)$
1	14	?	?
2	9	?	?
3	8	?	?
4	?	5	?
5	?	10	?
6	?	7	?
Mean	10.33	7.33	3

In RCT, $\tau = \bar{Y}_i(T) - \bar{Y}_i(C)$ estimates a causal effect
In general, $\tau$ is not causal for observational studies (OS)
- Many methods try to change this (Austin 2011a; Varga et al. 2023)

Propensity scores (PS)

“Probability of treatment assignment based on observed baseline covariates” (Rosenbaum and Rubin 1983)

Given treatment variable $X_i \in \{0,1\}$ and baseline covariates $\boldsymbol{Z}_i$, then estimate PS:

\[ PS_i = Pr(X_i = 1) = f(\boldsymbol{Z}_i) + \epsilon_i \qquad(1)\]

Logistic regression or machine learning to estimate Equation 1

Why model treatment assignment?

Confounder $\boldsymbol{Z}$ causes trt $\boldsymbol{X}$ and outcome $\boldsymbol{Y}$

This makes $\boldsymbol{X}$ correlated with $\boldsymbol{Y}$, but correlation $\neq$ causation

PS models the relation between $\boldsymbol{X}$ and $\boldsymbol{Z}$, thus removes confounding

https://sixsigmadsi.com/glossary/confounding/

Without a model for how treatments are assigned to units, formal causal inference is impossible (Little and Rubin 2000)

4 methods of using PS

Matching

The PS is a balancing score: patients with similar PS should have similar baseline covariates (Austin 2011a)

(McDonald et al. 2013)

PS is a composite summary of all baseline covariates
Match Trt-Control patients with similar PS values

Matching: many choices

Recommendations from simulations of (Austin 2014a):

Optimal vs greedy nearest neighbor matching?

Greedy: iteratively match to nearest neighbor
Optimal: minimize the total distance between all pairs

Caliper or no caliper?

https://help.easymedstat.com/support/solutions/articles/77000538175-caliper-in-propensity-score-matching

How wide should caliper be? 0.2*SD of logit PS (Austin 2011b)

Match with or without replacement?

For greedy matching, what order should you select the treated subjects?
- Lowest to highest PS, highest to lowest, best match first, or random order

Types of treatment effects

ATE: $E\big[Y_i(T) - Y_i(C)\big]$

Average effect for ALL patients in the population

ATT: $E\big[Y_i(T) - Y_i(C)\big | T]$

Among patients who were Treated, how would their outcomes have changed if they were a Control?

ATC: $E\big[Y_i(T) - Y_i(C)\big | C]$

Among patients who were Controls, how would their outcomes have changed if they were Treated?

Method	ATE	ATT/ATC
Matching	❌	✅
Stratification	✅	✅
Inverse probability weighting	✅	✅
Covariate adjustment	❌	❌

(Deb et al. 2016)

Example

Employment training and income

Treatment (N = 185): National Supported Work Demonstration (NSW) employment training program
Control (N = 429): from the Population Survey of Income Dynamics (PSID)
Goal: Does training program increase mean income (1978)?
Baseline covariates: age, education, race, marital status, pre-treatment income (1974/1975)

Nearest neighbor caliper matching

library(MatchIt);

set.seed(1234);
match.nnc.logit <- matchit(
    treat ~.,
    data = psdata,
    method = 'nearest',
    distance = 'glm',
    replace = FALSE,
    caliper = 0.2,
    std.caliper = TRUE,
    ratio = 1 # 1:1 matching
    );

MatchIt R package (Stuart 2011)
Default is logistic reg., but ?distance gives many other options (e.g. LASSO, random forests, boosting, NNets)
(Austin 2011b) recommends caliper = 0.2*SD logit PS

PS distribution before/after matching

Notice poor overlap before matching ➔ won’t match all Trt patients, unless $N_{control}$ is large

	Control	Treated
Total	429	185
Matched	113 (26.3%)	113 (61.1%)
Unmatched	316 (73.7%)	72 (38.9%)

⬆ unmatched Trt patients = ⬆ biased $\widehat{\text{ATT}}$, instead estimates “Average treatment effect in the Overlap population (ATO)” (Varga et al. 2023)

Covariate balance

Standardized differences

Cutoffs:

$\le$ 0.1 (Austin 2011a)
$\le$ 0.25 (Harder 2010)
No p-values (Austin 2011a)

Estimated treatment effect

trt.mean	ctrl.mean	diff	conf.low	conf.high	p.value
6510	4938	1572	-365	3508	0.111

Interestingly, PS estimate of trt effect ($1572) is close to RCT estimate ($1641) (LaLonde 1986)

Optimal matching

Matches 100% of Trt patients (unlike greedy caliper match)
Attempts to optimize the total distance between all pairs
However, resulted in much worse covariate balance

Summary

Goal: estimate causal effect of exposure $X$ on outcome $Y$, using observational data
PS matching balances measured confounders between exposure groups (similar to RCT)
- Intuitive: show covariates are balanced after matching, like RCT
- Of course, unmeasured confounders may remain
If estimating ATT (ATC), then need to match all Trt (Control) patients. If high % unmatched, consider other methods.

Alternatives

Stratification on PS uses all patients, but does not work well with survival data (Austin 2014b, 2016)
Inverse probability weighting and regression adjustment are very flexible, but generally not accepted by FDA (unlike matching) (Lu 2019)
Causal Random Forests look promising: https://grf-labs.github.io/grf/

References

Austin, P. 2011a. “An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies.” Multivariate Behavioral Research 46 (3): 399–424.

———. 2011b. “Optimal Caliper Widths for Propensity-Score Matching When Estimating Differences in Means and Differences in Proportions in Observational Studies.” Pharmaceutical Statistics 10 (2): 150–61.

———. 2014a. “A Comparison of 12 Algorithms for Matching on the Propensity Score.” Statistics in Medicine 33 (6): 1057–69.

———. 2014b. “The Use of Propensity Score Methods with Survival or Time-to-Event Outcomes: Reporting Measures of Effect Similar to Those Used in Randomized Experiments.” Statistics in Medicine 33 (7): 1242–58.

———. 2016. “The Performance of Different Propensity Score Methods for Estimating Absolute Effects of Treatments on Survival Outcomes: A Simulation Study.” Statistical Methods in Medical Research 25 (5): 2214–37.

Chen, Jeffrey W, David R Maldonado, Brooke L Kowalski, Kara B Miecznikowski, Cynthia Kyin, Jeffrey A Gornbein, and Benjamin G Domb. 2022. “Best Practice Guidelines for Propensity Score Methods in Medical Research: Consideration on Theory, Implementation, and Reporting. A Review.” Arthroscopy: The Journal of Arthroscopic & Related Surgery 38 (2): 632–42.

Deb, Saswata, Peter C Austin, Jack V Tu, Dennis T Ko, C David Mazer, Alex Kiss, and Stephen E Fremes. 2016. “A Review of Propensity-Score Methods and Their Use in Cardiovascular Research.” Canadian Journal of Cardiology 32 (2): 259–65.

Dehejia, Rajeev H, and Sadek Wahba. 1999. “Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs.” Journal of the American Statistical Association 94 (448): 1053–62.

Franklin, Jessica M, Elisabetta Patorno, Rishi J Desai, Robert J Glynn, David Martin, Kenneth Quinto, Ajinkya Pawar, et al. 2021. “Emulating Randomized Clinical Trials with Nonrandomized Real-World Evidence Studies: First Results from the RCT DUPLICATE Initiative.” Circulation 143 (10): 1002–13.

Harder, V. 2010. “Propensity Score Techniques and the Assessment of Measured Covariate Balance to Test Causal Associations in Psychological Research.” Psychological Methods 15 (3): 234.

LaLonde, Robert J. 1986. “Evaluating the Econometric Evaluations of Training Programs with Experimental Data.” The American Economic Review, 604–20.

Little, Roderick J, and Donald B Rubin. 2000. “Causal Effects in Clinical and Epidemiological Studies via Potential Outcomes: Concepts and Analytical Approaches.” Annual Review of Public Health 21 (1): 121–45.

Lu, N. 2019. “Good Statistical Practice in Utilizing Real-World Data in a Comparative Study for Premarket Evaluation of Medical Devices.” Journal of Biopharmaceutical Statistics 29 (4): 580–91.

McDonald, Robert J, Jennifer S McDonald, David F Kallmes, and Rickey E Carter. 2013. “Behind the Numbers: Propensity Score Analysis—a Primer for the Diagnostic Radiologist.” Radiology 269 (3): 640–45.

Moore, Thomas J, Hanzhe Zhang, Gerard Anderson, and G Caleb Alexander. 2018. “Estimated Costs of Pivotal Trials for Novel Therapeutic Agents Approved by the US Food and Drug Administration, 2015-2016.” JAMA Internal Medicine 178 (11): 1451–57.

Rosenbaum, Paul R, and Donald B Rubin. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70 (1): 41–55.

Rubin, Donald B. 2005. “Causal Inference Using Potential Outcomes: Design, Modeling, Decisions.” Journal of the American Statistical Association 100 (469): 322–31.

Stuart, Elizabeth. 2011. “MatchIt: Nonparametric Preprocessing for Parametric Causal Inference.” Journal of Statistical Software.

Varga, Anita Natalia, Alejandra Elizabeth Guevara Morel, Joran Lokkerbol, Johanna Maria van Dongen, Maurits Willem van Tulder, and Judith Ekkina Bosmans. 2023. “Dealing with Confounding in Observational Studies: A Scoping Review of Methods Evaluated in Simulation Studies with Single-Point Exposure.” Statistics in Medicine 42 (4): 487–516.

Wouters, Olivier J, Martin McKee, and Jeroen Luyten. 2020. “Estimated Research and Development Investment Needed to Bring a New Medicine to Market, 2009-2018.” Jama 323 (9): 844–53.