Run Random Forest Imputation with Optional Parallelization
run_rf_imputation.RdThis function performs missing value imputation using the missForest package
on a subset of columns in a data frame. Column selection can be explicit or based on a
regular expression, using resolve_target_cols internally.
Usage
run_rf_imputation(df, target_cols, control_RF = list())Arguments
- df
A data frame containing the data to be imputed. Only non-QC rows should be passed here.
- target_cols
A character vector of column names or a regex string to identify columns to impute.
- control_RF
A named list of additional or overriding arguments passed to
missForest::missForest(). Also supports an internal control argumentn_cores(numeric), used to determine parallelism whenparallelizeis not "no".Accepted entries include (see
missForest::missForestfor details):mtryNumber of variables randomly sampled at each split.
ntreeNumber of trees to grow.
maxiterMaximum number of imputation iterations.
parallelizeCharacter string: "no", "variables", or "forests". Enables parallelization.
variablewiseLogical. If TRUE, variable-wise error is returned.
verboseLogical. If TRUE, progress messages are printed.
replaceLogical. Whether sampling of observations is done with replacement.
classwt,cutoff,strata,sampsize, etc.Other optional arguments passed to
randomForest::randomForest.n_cores(Internal) Number of CPU cores to use if
parallelizeis not "no". This is removed before callingmissForest().
Value
A list with two components:
- imputed
A data frame of the same shape as the input subset, with imputed values.
- oob
A named numeric vector of out-of-bag error estimates for each feature (if available).
Details
The function supports optional parallel execution via missForest's parallelize
argument. If n_cores is less than or equal to 1, parallelization is disabled automatically
and the imputation runs sequentially.
See also
missForest, resolve_target_cols,
register_cluster, unregister_cluster
Examples
if (FALSE) { # \dontrun{
df <- data.frame(a = c(1, NA, 3), b = c(4, 5, NA))
# Default imputation on selected columns
run_rf_imputation(df, target_cols = c("a", "b"))
# Use regex to select columns and customize RF parameters
run_rf_imputation(
df,
target_cols = "^a|b$",
control_RF = list(ntree = 200, mtry = 1, n_cores = 4)
)
} # }