1 Intro

Missing data analysis: Making it work in the real world

Multiple imputation is the gold standard method for handling missing data but is computationally intensive.
EM algorithm works well, but lack of general software.
If doing likelihood-based regression modeling, then as long as you adjust for variables that influence missingness, results will not be biased due to missing data (although patients with missing values are dropped from the model, thus decreasing power).
- Imputation likely still better due to improved power (i.e. all patients can be included).

2 Single imputation

There are various scenarios where one wants to impute missing values a single time:

When the goal is not statistical inference, but prediction
If you want to cluster your data to organize it in a heatmap or other figure, most clustering methods can’t handle missing values.
If the goal is statistical inference, often people will still do single imputation despite multiple imputation being the gold standard. Single imputation is so much easier/faster although more biased.

R notes:

Set the seed beforehand
All of these methods will throw errors if some features have 0 variance (i.e. they only take on 1 value). Remove such features beforehand.
For missForest/missMDA: make sure rownames are set to patient.id beforehand
For impute::impute.knn: make sure rownames are feature names and colnames are patient.id

suppressPackageStartupMessages(library(missForest))
suppressPackageStartupMessages(library(missMDA))
suppressPackageStartupMessages(library(impute))

2.1 Continuous

set.seed(123)

data('geno', package = 'missMDA')

# impute.knn (features in rows, samples in cols)
# this is my preferred method for high dimensional continuous data since it is relatively fast
knn.imp <- impute::impute.knn(t(geno))$data
knn.imp <- data.frame(t(knn.imp), check.names = FALSE)

# PCA (samples in rows, features in cols)
ncomp <- missMDA::estim_ncpPCA(geno, ncp.min = 0, ncp.max = 6)
pca.imp <- missMDA::imputePCA(geno, ncp = ncomp$ncp, scale = TRUE)$completeObs

# missForest
mf.imp <- missForest::missForest(geno)$ximp

2.2 Categorical

data('vnf', package = 'missMDA')

# missForest
mf.imp <- missForest::missForest(vnf)$ximp

# MCA
#ncomp <- missMDA::estim_ncpMCA(vnf); # slow method to estimate number of components
mca.imp <- missMDA::imputeMCA(vnf, ncp = 3)$complete.obs

2.3 Mixed continuous/categorical

data('snorena', package = 'missMDA')

# missForest
mf.imp <- missForest::missForest(snorena)$ximp

# FAMD
#missMDA::estim_ncpFAMD(snorena); # slow method to estimate number of components
famd.imp <- missMDA::imputeFAMD(snorena, ncp = 3)$completeObs

3 Multiple imputation

As mentioned in the intro, multiple imputation is the gold standard for handling missing data because it accounts for uncertainty in the imputed values.
Impute dataset multiple times to create multiple imputed datasets. Analyze each dataset separately then pool the results for statistical inference.
Multiple imputation by chained equations: what is it and how does it work?
mice R package
- mice JSS article

4 Bootstrap imputation

Somewhat related to Section 3 in that you impute multiple times, bootstrap imputation is typically simpler for the analyst to setup. The basic idea is:

Choose one of the “single imputation methods” from Section 2
Do simple nonparametric bootstrap resampling with replacement to create multiple “bootstrap versions” of your dataset
Impute each bootstrap version of the dataset using the method chosen in step 1
Analyze each imputed bootstrap dataset separately then pool the results for statistical inference. Standard bootstrap inference methods apply here (e.g. percentile confidence intervals, bootstrap-pvalues, etc.)

Although generally simpler to implement than multiple imputation (especially if one uses the boot R package), the main downside is it is typically much slower.

Expand for Session Info

R version 4.5.1 (2025-06-13)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.7.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] impute_1.84.0    missMDA_1.21     missForest_1.6.1

loaded via a namespace (and not attached):
 [1] gtable_0.3.6         shape_1.4.6.1        xfun_0.57           
 [4] ggplot2_4.0.3        htmlwidgets_1.6.4    ggrepel_0.9.8       
 [7] lattice_0.22-7       vctrs_0.7.3          tools_4.5.1         
[10] Rdpack_2.6.6         generics_0.1.4       parallel_4.5.1      
[13] tibble_3.3.1         pan_1.9              cluster_2.1.8.1     
[16] jomo_2.7-6           pkgconfig_2.0.3      Matrix_1.7-3        
[19] RColorBrewer_1.1-3   S7_0.2.2             rngtools_1.5.2      
[22] scatterplot3d_0.3-45 lifecycle_1.0.5      compiler_4.5.1      
[25] farver_2.1.2         leaps_3.2            codetools_0.2-20    
[28] htmltools_0.5.9      yaml_2.3.12          glmnet_4.1-10       
[31] mice_3.19.0          nloptr_2.2.1         pillar_1.11.1       
[34] FactoMineR_2.14      tidyr_1.3.2          MASS_7.3-65         
[37] flashClust_1.1-4     DT_0.34.0            reformulas_0.4.4    
[40] doRNG_1.8.6.3        iterators_1.0.14     rpart_4.1.24        
[43] boot_1.3-31          foreach_1.5.2        mitml_0.4-5         
[46] nlme_3.1-168         tidyselect_1.2.1     digest_0.6.39       
[49] mvtnorm_1.3-7        dplyr_1.2.1          purrr_1.2.2         
[52] splines_4.5.1        fastmap_1.2.0        grid_4.5.1          
[55] cli_3.6.6            magrittr_2.0.5       randomForest_4.7-1.2
[58] survival_3.8-3       broom_1.0.12         scales_1.4.0        
[61] backports_1.5.1      estimability_1.5.1   rmarkdown_2.31      
[64] emmeans_2.0.3        nnet_7.3-20          lme4_2.0-1          
[67] otel_0.2.0           ranger_0.18.0        evaluate_1.0.5      
[70] knitr_1.51           rbibutils_2.4.1      doParallel_1.0.17   
[73] rlang_1.2.0          itertools_0.1-3      Rcpp_1.1.1-1.1      
[76] glue_1.8.1           BiocManager_1.30.27  renv_1.1.5          
[79] minqa_1.2.8          jsonlite_2.0.0       R6_2.6.1            
[82] multcompView_0.1-11

Citation

BibTeX citation:

@online{arbet2026,
  author = {Arbet, Jaron},
  title = {Imputing Missing Data},
  date = {2026-05-01},
  url = {https://jarbet.github.io/possibly-significant/posts/imputation/},
  langid = {en}
}

For attribution, please cite this work as:

Arbet, Jaron. 2026. “Imputing Missing Data.” May 1. https://jarbet.github.io/possibly-significant/posts/imputation/.