Remove outliers (optionally stratified) using PCA + LOF
remove_outliers.RdWrapper around outlier_pca_lof() with conveniences:
Select a subset of columns (
target_cols)Exclude QC rows from outlier detection
Temporarily impute missing values (half of column minimum) for detection
Apply detection independently within user-defined strata (
strata)
Arguments
- df
A data.frame with features in columns and samples in rows.
- target_cols
Character vector of column names (or tidyselect helpers if supported by
resolve_target_cols()). IfNULL, usesresolve_target_cols(df, NULL)to infer targets.- is_qc
Logical vector (length
nrow(df)) marking QC rows to exclude from detection. Defaults to allFALSE.- method
Character; currently supports
"pca-lof-overall"(default behavior).- impute_method
NULLor"half-min-value". When set, missing values intarget_colsare imputed as half the minimum non-missing value—by default within each stratum; if any #' stratum has a target column entirelyNA, imputation is performed globally on non-QC rows (see Missing-data policy).- restore_missing_values
Logical; if
TRUE, originalNAs intarget_colsare restored after filtering.- return_ggplots
Logical; if
TRUE, returns a named list of ggplots per stratum.- strata
NULL(default), a single column name indf, or an external vector of lengthnrow(df). When provided, outlier detection is run independently within each stratum (QC rows excluded within the stratum).
Value
A list with:
- df_filtered
dfwith detected outlier rows removed (QC rows always retained).- excluded_ids
Character vector of row names removed (union across strata).
- plot_samples_outlier
If
return_ggplots = TRUE, a named list of ggplot objects per stratum; otherwiseNULL.
Details
Stratification
Set strata to:
NULL(default) to run detection once over all non-QC rows, ora single column name in
df, oran external vector (length
nrow(df)) to group samples.
Outlier detection is performed independently within each stratum on non-QC rows (QC rows are always excluded from detection but retained in the output). Strata with fewer than 5 non-QC samples are skipped (no outliers removed for that stratum).
Missing-data policy
If
impute_method = NULLand anytarget_colscontain missing values among non-QC rows, the function errors and lists the affected columns with counts. Enableimpute_method = "half-min-value"or resolve missingness beforehand.If
impute_method = "half-min-value":The function first checks for target columns that are entirely
NAacross all non-QC rows. If any exist, it errors (a half-minimum cannot be computed).It then checks, per stratum, for target columns that are entirely
NAwithin that stratum. If any are found, a warning is emitted listing the affected strata and columns, and temporary imputation is applied on the whole non-QC dataset (ignoring stratification), while outlier detection still runs per stratum as requested.
After outlier removal, if
restore_missing_values = TRUE, the originalNAs intarget_colsare restored in the returned data.
Examples
# 1) No stratification
remove_outliers(
df,
target_cols = c("f1","f2"),
impute_method = "half-min-value"
)
#> Error in rep(FALSE, nrow(df)): invalid 'times' argument
# 2) Stratify by a column in df
remove_outliers(
df,
target_cols = c("f1","f2"),
strata = "batch",
impute_method = "half-min-value"
)
#> Error in rep(FALSE, nrow(df)): invalid 'times' argument
# 3) Stratify by an external vector
my_strata <- c("A", "A", "B", "B", "B", "C", "C")
remove_outliers(
df,
target_cols = c("f1","f2"),
strata = grp,
impute_method = "half-min-value"
)
#> Error in rep(FALSE, nrow(df)): invalid 'times' argument
# 4) Stratum with all-NA target columns -> triggers global temporary imputation (warning)
# \donttest{
df2 <- data.frame(
f1 = c(1, 2, 3, NA, NA, NA),
f2 = c(2, 3, 4, NA, NA, NA),
batch = c("A","A","A","B","B","B")
)
rownames(df2) <- paste0("s", seq_len(nrow(df2)))
remove_outliers(
df2,
target_cols = c("f1","f2"),
strata = "batch",
impute_method = "half-min-value"
)
#> Warning:
#>
#> Detected strata with columns entirely missing among non-QC rows. Applying temporary imputation on the whole non-QC dataset (ignoring stratification). Affected strata and columns:
#> strat : columns
#> - B: f1, f2
#> $df_filtered
#> f1 f2 batch
#> s1 1 2 A
#> s2 2 3 A
#> s3 3 4 A
#> s4 NA NA B
#> s5 NA NA B
#> s6 NA NA B
#>
#> $plot_samples_outlier
#> NULL
#>
#> $excluded_ids
#> character(0)
#>
# }