Mendelian Randomization in R

Causal inference in observational studies

Jaron Arbet

2025-07-25

Background

Instrumental Variable Analysis (IVA)

  • Method for doing causal inference in observational study
  • Mendelian Randomization is a particular form of IVA where IVs are germline genetic variant(s)

3 Conditions of an IV:

  1. IV is associated with the exposure

  2. No unmeasured confounders of the IV-outcome relationship

  3. IV does not affect the outcome other than through the exposure and does not affect any other trait that has a downstream effect on the outcome of interest.

Finally, MR makes one more assumption:

  1. Changes in the genetic IV have equivalent effects to changes in the exposure through environmental or drug intervention: gene–environment equivalence

    Sanderson, Eleanor, et al. “Mendelian randomization.” Nature reviews Methods primers 2.1 (2022): 6.

Intuition

If genetic variant(s) \(G\) cause exposure (\(G \rightarrow X\)) and cause the outcome only through the exposure (\(G \rightarrow X \rightarrow Y\)), then the exposure has a causal effect on the the outcome!

  • The above holds even if there are unmeasured confounders (\(U\)) of \(X\)-\(Y\) relationship
  • The key to “causal” is the fact that the genetic variant(s) are randomly assigned at conception, so they are not confounded by other factors that may affect the exposure and outcome.. similar to RCT.

    Go, Tae-Hwa, and Dae Ryong Kang. “Basic concepts of a mendelian randomization approach.” Cardiovascular Prevention and Pharmacotherapy 2.1 (2020): 24-30

Relation to RCT

    Sanderson, Eleanor, et al. “Mendelian randomization.” Nature reviews Methods primers 2.1 (2022): 6.

Data used for MR

  1. Patient-level data
  • If you have patient-level genetic, exposure & outcome data, then you can use a simple 2-stage regression approach
  1. Summary-level data:
  • GWAS summary stats (regression coefficient and standard error) are publicly available for thousands of phenotypes
  • If you can find GWAS summary stats for a set of variants associated with your exposure and corresponding summary stats for association with outcome, then you can use a meta-analysis approach to estimate the causal effect

MR with patient-level data: 2-stage regression

2-stage regression

  • \(X\): exposure
  • \(Y\): outcome
  • \(Z\): genetic variant(s) (instrumental variable)
  • \(L\): optional covariates to improve precision
  1. Model to predict exposure

\[ X = \pi_0 + \pi_{Z}Z + \pi_{L}L + \epsilon_X \]

  Obtain predicted exposure \(\hat{X}\)

  1. Model to predict outcome

\[ Y = \beta_0 + \beta_X\hat{X} + \beta_{L}L + \epsilon_Y \]

  • \(\hat{\beta}_X\) is the estimated causal effect of the exposure on the outcome. If \(\beta_X\) p-value < 0.05, then evidence for a causal association between \(X\) and \(Y\).

  • Can’t use traditional standard errors/CIs/pvalues. Modified versions are used to account for the uncertainty in \(\hat{X}\).

  • The above approach is unbiased for linear regression but biased for non-linear regression (e.g. logistic reg or CoxPH). Bias adjustments have been developed for these models.

Example

library(survival);
library(ivtools);
data(VitD);
VitD$vitd.zscore <- as.numeric(scale(VitD$vitd));
head(VitD);
  age filaggrin  vitd     time death vitd.zscore
1  41         0  53.3 16.16220     0  -0.4207751
2  62         0  26.4 16.06470     0  -1.4157752
3  41         0  47.7 16.34796     0  -0.6279127
4  42         0 103.5 16.32993     0   1.4360651
5  52         0  79.0 16.74289     0   0.5298383
6  62         0  70.1 15.17891     0   0.2006375
  • age: age at baseline
  • filaggrin: binary indicator of whether the subject has mutations in the filaggrin gene
  • vitd.zscore: z-score of baseline vitamin D level
  • time: followup time (years)
  • death: whether the subject died during follow-up

2-stage regression models

Notation:

  • Exposure model (fitX.LZ): predict exposure X with instrumental variable(s) Z and covariates L
  • Outcome model (fitT.LX): predict time-to-event outcome T with exposure X and covariates L
  1. Exposure model:
fitX.LZ <- glm(
    formula = vitd.zscore ~ age + filaggrin,
    data = VitD,
    family = gaussian
    );
gtsummary::tbl_regression(fitX.LZ);
Characteristic Beta 95% CI p-value
age -0.01 -0.01, 0.00 0.006
filaggrin 0.21 0.06, 0.35 0.006
Abbreviation: CI = Confidence Interval
  • The genetic instrumental variable is significantly associated with the exposure (vitamin D)
  1. Outcome model:
fitT.LX <- coxph(
    formula = Surv(time, death) ~ age + vitd.zscore,
    data = VitD
    );
gtsummary::tbl_regression(fitT.LX , exponentiate = TRUE);
Characteristic HR 95% CI p-value
age 1.10 1.09, 1.11 <0.001
vitd.zscore 0.82 0.75, 0.90 <0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio
  • The exposure (vitamin D) is significantly associated with the outcome (mortality), however, there may be unobserved confounders \(\rightarrow\) we will use IVA to try and deal with this.
  1. Combine the exposure and outcome models to estimate the causal effect:
  • Since we have a time-to-event outcome, use ivcoxph(). For continuous or binary outcomes, use ivglm().
fit.iva <- ivtools::ivcoxph(
    estmethod = 'ts',
    fitX.LZ = fitX.LZ,
    fitT.LX = fitT.LX,
    data = VitD,
    ctrl = TRUE # reduced bias
    );

# Make table of results
tab <- data.frame(summary(fit.iva)$coef, check.names = FALSE);
tab$HR <- round(exp(tab$Estimate), 2);
tab$pvalue <- ifelse(tab$`Pr(>|z|)` < 0.001, '<0.001', round(tab$`Pr(>|z|)`, 3));
ci <- round(exp(confint(fit.iva)), 2);
tab <- cbind(tab, ci);
tab$feature <- rownames(tab);

gt::gt(tab[, c('feature', 'HR', '2.5 %', '97.5 %', 'pvalue')]);
feature HR 2.5 % 97.5 % pvalue
age 1.10 1.08 1.11 <0.001
vitd.zscore 0.21 0.04 1.09 0.063
R 0.83 0.75 0.91 <0.001
  • A 1 standard deviation increase in vitamin D is causally associated with 0.21 times less risk of death (95% CI: 0.04 to 1.09, p = 0.063).

MR with summary data: meta-analysis approach

How do we choose genetic instrumental variables?

  • Identify GWAS SNPs from public databases that are associated with the exposure (\(p < 5.0 * 10^{–8}\))
  • Using public databases, we can obtain regression coefficients and standard errors for the set of genetic IVs for their association with the exposure (\(X\)) and outcome (\(Y\))
  • Then we can use “inverse variance weighting” (IVW) meta analysis methods to combine all the info and estimate the causal effect of \(X\) on \(Y\)

Inverse Variance Weighting (IVW) meta analysis method

Example

(Yavorska and Burgess 2017):

  • Exposure (\(X\)): high-density lipoprotein chol. (log HDL-c)
  • Outcome (\(Y\)): coronary heart disease (CHD)
  • Instrumental variables: 28 genetic SNPs known to be associated with HDL
library(MendelianRandomization);

# Example input data
mr.input <- mr_input(
    bx = hdlc, # slopes of SNPs on continuous exposure
    bxse = hdlcse, # standard errors
    by = chdlodds, # log odds ratios of SNPs on binary outcome
    byse = chdloddsse # standard errors of log odds ratios
    );
mr.input;
      SNP exposure.beta exposure.se outcome.beta outcome.se
1   snp_1        0.0020       0.004       0.0677     0.0286
2   snp_2        0.0050       0.004      -0.1625     0.0300
3   snp_3        0.0030       0.004      -0.1054     0.0310
4   snp_4        0.0010       0.003      -0.0619     0.0243
5   snp_5        0.0110       0.003      -0.0834     0.0222
6   snp_6        0.0310       0.006      -0.1278     0.0667
7   snp_7       -0.0030       0.004      -0.0408     0.0373
8   snp_8       -0.0070       0.006       0.0770     0.0543
9   snp_9       -0.0210       0.004       0.1570     0.0306
10 snp_10        0.0180       0.003      -0.0305     0.0236
11 snp_11       -0.0170       0.003       0.0100     0.0277
12 snp_12       -0.0470       0.005       0.1823     0.0403
13 snp_13        0.0220       0.004      -0.0408     0.0344
14 snp_14       -0.0290       0.004       0.1989     0.0335
15 snp_15        0.0160       0.003       0.0100     0.0378
16 snp_16        0.0340       0.004       0.0488     0.0292
17 snp_17        0.0350       0.003       0.0100     0.0253
18 snp_18        0.0190       0.004      -0.0408     0.0319
19 snp_19        0.0280       0.004      -0.0305     0.0316
20 snp_20        0.0001       0.003      -0.0408     0.0241
21 snp_21        0.0160       0.003      -0.0202     0.0285
22 snp_22        0.0050       0.003      -0.0619     0.0217
23 snp_23       -0.0100       0.004       0.0296     0.0298
24 snp_24       -0.0230       0.003       0.0677     0.0239
25 snp_25        0.0120       0.003      -0.0726     0.0220
26 snp_26        0.0180       0.003      -0.0726     0.0246
27 snp_27       -0.0060       0.003       0.0000     0.0255
28 snp_28        0.0040       0.006       0.0198     0.0647

IVW meta analysis

res <- mr_ivw(mr.input);
res;

Inverse-variance weighted method
(variants uncorrelated, random-effect model)

Number of Variants : 28 

------------------------------------------------------------------
 Method Estimate Std Error  95% CI        p-value
    IVW   -2.375     0.678 -3.704, -1.047   0.000
------------------------------------------------------------------
Residual standard error =  2.284 
Heterogeneity test statistic (Cochran's Q) = 140.9022 on 27 degrees of freedom, (p-value = 0.0000). I^2 = 80.8%. 
F statistic = 27.6. 
  • A 1 unit increase in log HDL-c is causally associated with 0.09 times lower odds of CHD (95% CI: 0.02 to 0.35, p = 0.00046).

  • The significant Heterogeneity test means there is heterogenity in the effect sizes between the SNPs. Ideally the SNPs should have consistent effects.. sensitivity analyses are used to address inconsistent effects.

Sensitivity analyses using Robust MR

Since it is unlikely that all genetic variants will be valid instrumental variables, several robust methods have been proposed. We compare nine robust methods for MR based on summary data…Our recommendation for investigators is to perform a variety of robust methods that operate in different ways and rely on different assumptions for valid inferences to assess the reliability of MR analyses. (Slob and Burgess 2020)

Robust MR in R

  • The MendelianRandomization R package implements many methods for robust MR

  • mr_allmethods(): variations of 3 robust methods

mr_allmethods(mr.input);
                    Method Estimate Std Error 95% CI         P-value
             Simple median   -2.952     0.834  -4.586 -1.318   0.000
           Weighted median   -1.822     0.593  -2.985 -0.659   0.002
 Penalized weighted median   -1.981     0.614  -3.184 -0.778   0.001
                                                                    
                       IVW   -2.375     0.678  -3.704 -1.047   0.000
             Penalized IVW   -2.963     0.504  -3.952 -1.975   0.000
                Robust IVW   -2.471     0.930  -4.295 -0.648   0.008
      Penalized robust IVW   -2.937     0.469  -3.856 -2.019   0.000
                                                                    
                  MR-Egger   -0.492     1.057  -2.563  1.580   0.642
               (intercept)   -0.043     0.019  -0.081 -0.005   0.026
        Penalized MR-Egger   -0.068     0.772  -1.582  1.446   0.930
               (intercept)   -0.045     0.012  -0.070 -0.021   0.000
           Robust MR-Egger   -0.344     1.692  -3.660  2.972   0.839
               (intercept)   -0.044     0.022  -0.088 -0.001   0.046
 Penalized robust MR-Egger    0.039     0.967  -1.858  1.935   0.968
               (intercept)   -0.047     0.015  -0.076 -0.018   0.001
  • Contrary to its name mr_allmethods does not implement all of the MR methods in the package.. Other robust methods include:
    • mr_cML: robust to all 3 IV assumptions
    • mr_divw
    • mr_pivw
    • mr_mvgmm
    • mr_conmix

Sensitivity analyses

Variant specific analyses

mr_forest(mr.input);

Leave one out sensitivity analysis

# currently only supports the IVW method
mr_loo(mr.input);

Comparing causal estimates from multiple methods

mr_forest(
    mr.input,
    snp_estimates = FALSE,
    methods = c('ivw', 'median', 'wmedian', 'egger', 'maxlik', 'mbe', 'conmix')
    );

Extensions in MendelianRandomization R package

Cancer applications

Other ideas for how MR could be used in our lab?

Summary

  • Instrumental variable analysis (IVA): causal inference with observational data
  • Mendelian randomization (MR): particular type of IVA where the IV are germline genetic variant(s)
  • MR has 4 main assumptions
    • Best practice recommends running multiple “robust MR methods” as sensitivity analyses (Burgess et al. 2023)
    • MendelianRandomization R package implements many of these robust methods
  • If you have patient-level data then you can use two-stage regression methods (ivtools package) OR the inverse-variance-weighting (IVW) meta analysis approach that only requires summary statistics (they are asymptotically equivalent for continuous outcomes)
  • If you only have summary-level data (regression coeffs and standard errors of SNPs vs. exposure and SNPs vs. outcome) then you can use the IVW meta analysis method or any other method in the MendelianRandomization R package

References

Broadbent, Jim R, Christopher N Foley, Andrew J Grant, Amy M Mason, James R Staley, and Stephen Burgess. 2020. “MendelianRandomization V0. 5.0: Updates to an r Package for Performing Mendelian Randomization Analyses Using Summarized Data.” Wellcome Open Research 5: 252.
Burgess, Stephen, George Davey Smith, Neil M Davies, Frank Dudbridge, Dipender Gill, M Maria Glymour, Fernando P Hartwig, et al. 2023. “Guidelines for Performing Mendelian Randomization Investigations: Update for Summer 2023.” Wellcome Open Research 4: 186.
Martinussen, Torben, Ditte Nørbo Sørensen, and Stijn Vansteelandt. 2019. “Instrumental Variables Estimation Under a Structural Cox Model.” Biostatistics 20 (1): 65–79.
Papadimitriou, Nikos, Niki Dimou, Konstantinos K Tsilidis, Barbara Banbury, Richard M Martin, Sarah J Lewis, Nabila Kazmi, et al. 2020. “Physical Activity and Risks of Breast and Colorectal Cancer: A Mendelian Randomisation Analysis.” Nature Communications 11 (1): 597.
Patel, Ashish, Ting Ye, Haoran Xue, Zhaotong Lin, Siqi Xu, Benjamin Woolf, Amy M Mason, and Stephen Burgess. 2023. “MendelianRandomization V0. 9.0: Updates to an r Package for Performing Mendelian Randomization Analyses Using Summarized Data.” Wellcome Open Research 8: 449.
Sanderson, Eleanor, M Maria Glymour, Michael V Holmes, Hyunseung Kang, Jean Morrison, Marcus R Munafò, Tom Palmer, et al. 2022. “Mendelian Randomization.” Nature Reviews Methods Primers 2 (1): 6.
Sjolander, Arvid, and Torben Martinussen. 2019. “Instrumental Variable Estimation with the r Package Ivtools.” Epidemiologic Methods 8 (1).
Slob, Eric AW, and Stephen Burgess. 2020. “A Comparison of Robust Mendelian Randomization Methods Using Summary Data.” Genetic Epidemiology 44 (4): 313–29.
Watts, Eleanor L, Tomas I Gonzales, Tessa Strain, Pedro F Saint-Maurice, D Timothy Bishop, Stephen J Chanock, Mattias Johansson, et al. 2024. “Observational and Genetic Associations Between Cardiorespiratory Fitness and Cancer: A UK Biobank and International Consortia Study.” British Journal of Cancer 130 (1): 114–24.
Yavorska, Olena O, and Stephen Burgess. 2017. “MendelianRandomization: An r Package for Performing Mendelian Randomization Analyses Using Summarized Data.” International Journal of Epidemiology 46 (6): 1734–39.
Ying, Kejun, Hanna Liu, Andrei E Tarkhov, Marie C Sadler, Ake T Lu, Mahdi Moqri, Steve Horvath, Zoltán Kutalik, Xia Shen, and Vadim N Gladyshev. 2024. “Causality-Enriched Epigenetic Age Uncouples Damage and Adaptation.” Nature Aging 4 (2): 231–46.