Skip to contents

Step 4: Hybrid Imputation (Random Forest + LCMD)

imputed_results <- OmicsProcessing::hybrid_imputation(
  log_transformed_df,
  target_cols = "@",
  method = c("RF-LCMD"),
  oobe_threshold = 0.1
)
imputed_df <- imputed_results$hybrid_rf_lcmd

hybrid_imputation() combines two complementary strategies:

The function fits RF per feature, uses the out-of-bag error (OOBE) to decide whether to keep the RF estimate or switch that feature to LCMD, and returns:

  • hybrid_rf_lcmd: the combined result.
  • rf / lcmd: per-method outputs.
  • oob: OOBE values (helpful for diagnostics).

See the full reference: hybrid_imputation().

Customising the RF and LCMD controls

You can tweak both engines via control lists:

my_control_RF <- list(
  parallelize = "no",
  mtry = floor(sqrt(length(target_cols))),
  ntree = 100,
  maxiter = 10,
  variablewise = TRUE,
  verbose = TRUE,
  n_cores = parallel::detectCores()
)

my_control_LCMD <- list(
  method.MAR = "KNN",
  method.MNAR = "QRILC"
)

df_rf_lcmd_hybrid <- OmicsProcessing::hybrid_imputation(
  log_transformed_df,
  target_cols = "@",
  method = c("RF-LCMD"),
  oobe_threshold = 0.1,
  control_LCMD = my_control_LCMD,
  control_RF = my_control_RF
)

Parallelising the RF step (missForest)

missForest can run in parallel when you register a foreach backend and set parallelize:

library(doParallel)

n_cores <- parallel::detectCores(logical = FALSE)
cl <- parallel::makeCluster(n_cores)
doParallel::registerDoParallel(cl)

ctrl_parallel_RF <- list(
  parallelize = "variables", # or "forests"
  mtry = floor(sqrt(length(target_cols))),
  ntree = 200,
  maxiter = 10,
  variablewise = TRUE,
  verbose = TRUE
)

imputed_parallel <- OmicsProcessing::hybrid_imputation(
  log_transformed_df,
  target_cols = "@",
  method = "RF-LCMD",
  oobe_threshold = 0.1,
  control_RF = ctrl_parallel_RF
)

parallel::stopCluster(cl)
doParallel::registerDoSEQ()

Guidance:

  • Use parallelize = "variables" for many features; "forests" spreads trees instead.
  • Keep ntree reasonable when parallelising to avoid memory pressure.

Tips:

  • Keep target_cols explicit when possible for clarity; "@" will use all feature columns resolved via resolve_target_cols().
  • Inspect imputed_results$oob to confirm the RF ↔︎ LCMD split aligns with your expectations.
  • For very wide matrices, tune ntree, mtry, or the number of worker cores to balance runtime and stability.