suppressPackageStartupMessages(library(missForest));
suppressPackageStartupMessages(library(missMDA));
suppressPackageStartupMessages(library(impute));
Imputing missing data
missing-data
Introduction to methods for imputing missing values
Intro
Missing data analysis: Making it work in the real world
- Multiple imputation is the gold standard method for handling missing data but is computationally intensive.
- EM algorithm works well, but lack of general software.
- If doing likelihood-based regression modeling, then as long as you adjust for variables that influence missingness, results will not be biased due to missing data (although patients with missing values are dropped from the model, thus decreasing power).
- Imputation likely still better due to improved power (i.e. all patients can be included).
Single imputation
There are various scenarios where one wants to impute missing values a single time:
- When the goal is not statistical inference, but prediction
- If you want to cluster your data before using
BoutrosLab.plotting.general::create.heatmap
- If the goal is statistical inference, often people will still do single imputation despite multiple imputation being the gold standard. Single imputation is so much easier/faster although more biased.
R notes:
- Set the seed beforehand
- All of these methods will throw errors if some features have 0 variance (i.e. they only take on 1 value). Remove such features beforehand.
- For
missForest/missMDA
: make sure rownames are set to patient.id beforehand - For
impute::impute.knn
: make sure rownames are feature names and colnames are patient.id
Continuous
set.seed(123);
data('geno', package = 'missMDA');
# impute.knn (features in rows, samples in cols)
# this is my preferred method for high dimensional continuous data since it is relatively fast
<- impute::impute.knn(t(geno))$data;
knn.imp <- data.frame(t(knn.imp), check.names = FALSE);
knn.imp
# PCA (samples in rows, features in cols)
<- missMDA::estim_ncpPCA(geno, ncp.min = 0, ncp.max = 6);
ncomp <- missMDA::imputePCA(geno, ncp = ncomp$ncp, scale = TRUE)$completeObs;
pca.imp
# missForest
<- missForest::missForest(geno)$ximp; mf.imp
Categorical
data('vnf', package = 'missMDA');
# missForest
<- missForest::missForest(vnf)$ximp;
mf.imp
# MCA
#ncomp <- missMDA::estim_ncpMCA(vnf); # slow method to estimate number of components
<- missMDA::imputeMCA(vnf, ncp = 3)$complete.obs; mca.imp
Mixed continuous/categorical
data('snorena', package = 'missMDA');
# missForest
<- missForest::missForest(snorena)$ximp;
mf.imp
# FAMD
#missMDA::estim_ncpFAMD(snorena); # slow method to estimate number of components
<- missMDA::imputeFAMD(snorena, ncp = 3)$completeObs famd.imp
Multiple imputation
- As mentioned in the intro, multiple imputation is the gold standard for handling missing data because it accounts for uncertainty in the imputed values.
- Impute dataset multiple times to create multiple imputed datasets. Analyze each dataset separately then pool the results for statistical inference.
- Multiple imputation by chained equations: what is it and how does it work?
- mice R package
Expand for Session Info
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] impute_1.80.0 missMDA_1.19 missForest_1.5
loaded via a namespace (and not attached):
[1] shape_1.4.6.1 gtable_0.3.5 xfun_0.49
[4] ggplot2_3.5.1 htmlwidgets_1.6.4 ggrepel_0.9.6
[7] lattice_0.22-6 vctrs_0.6.5 tools_4.4.1
[10] generics_0.1.3 parallel_4.4.1 sandwich_3.1-1
[13] tibble_3.2.1 fansi_1.0.6 pan_1.9
[16] cluster_2.1.6 jomo_2.7-6 pkgconfig_2.0.3
[19] Matrix_1.7-0 rngtools_1.5.2 scatterplot3d_0.3-44
[22] lifecycle_1.0.4 compiler_4.4.1 munsell_0.5.1
[25] leaps_3.2 codetools_0.2-20 htmltools_0.5.8.1
[28] glmnet_4.1-8 yaml_2.3.10 mice_3.17.0
[31] nloptr_2.1.1 pillar_1.9.0 FactoMineR_2.11
[34] tidyr_1.3.1 MASS_7.3-60.2 flashClust_1.01-2
[37] DT_0.33 doRNG_1.8.6 iterators_1.0.14
[40] rpart_4.1.23 boot_1.3-30 mitml_0.4-5
[43] multcomp_1.4-26 foreach_1.5.2 nlme_3.1-164
[46] tidyselect_1.2.1 digest_0.6.37 mvtnorm_1.3-1
[49] dplyr_1.1.4 purrr_1.0.2 splines_4.4.1
[52] fastmap_1.2.0 grid_4.4.1 colorspace_2.1-1
[55] cli_3.6.3 magrittr_2.0.3 randomForest_4.7-1.2
[58] survival_3.6-4 utf8_1.2.4 broom_1.0.7
[61] TH.data_1.1-2 scales_1.3.0 backports_1.5.0
[64] estimability_1.5.1 rmarkdown_2.28 emmeans_1.10.6
[67] nnet_7.3-19 lme4_1.1-35.5 zoo_1.8-12
[70] coda_0.19-4.1 evaluate_1.0.0 knitr_1.48
[73] doParallel_1.0.17 rlang_1.1.4 itertools_0.1-3
[76] Rcpp_1.0.13 xtable_1.8-4 glue_1.8.0
[79] minqa_1.2.8 rstudioapi_0.16.0 jsonlite_1.8.9
[82] R6_2.5.1 multcompView_0.1-10