Filter a data.frame by missingness in rows and columns
filter_by_missingness.RdApplies different strategies to filter rows and columns of a dataset based on missingness thresholds. Supports iterative refinement until stable.
Usage
filter_by_missingness(
df,
row_thresh = 0.5,
col_thresh = 0.5,
target_cols = NULL,
is_qc = NULL,
filter_order = c("iterative", "simultaneous", "col_then_row", "row_then_col"),
max_iter = 10
)Arguments
- df
A data.frame.
- row_thresh
Proportion of missing values allowed per (non-QC) row in target columns.
- col_thresh
Proportion of missing values allowed per column in target columns.
- target_cols
Character vector of target columns. If NULL, resolved by
resolve_target_cols().- is_qc
Logical vector the same length as
nrow(df)indicating QC rows (always retained).- filter_order
One of
"iterative"(default),"col_then_row","row_then_col", or"simultaneous"."iterative": alternately filter rows and columns until stable ormax_iteris reached."col_then_row": filter columns first, then rows."row_then_col": filter rows first, then columns."simultaneous": filter rows and columns independently, then intersect results.
- max_iter
Maximum number of iterations when
filter_order="iterative". Default 10.
Examples
# Example dataset with deliberate missingness
df <- data.frame(
a = c(NA, 1, NA, 1, NA),
b = c(NA, NA, 2, 2, NA),
c = c(3, NA, NA, 3, NA),
d = 1
)
# Mark row 2 and 5 as QC (always kept, but excluded from filtering thresholds)
is_qc <- c(FALSE, TRUE, FALSE, FALSE, TRUE)
target_cols <- c("a","b","c")
# Original
print(df)
#> a b c d
#> 1 NA NA 3 1
#> 2 1 NA NA 1
#> 3 NA 2 NA 1
#> 4 1 2 3 1
#> 5 NA NA NA 1
# a b c d
# 1 NA NA 3 1
# 2 1 NA NA 1
# 3 NA 2 NA 1
# 4 1 2 3 1
# 5 NA NA NA 1
# Iterative filtering
print(filter_by_missingness(df, row_thresh=0.5, col_thresh=0.5,
target_cols=target_cols, is_qc=is_qc,
filter_order="iterative"))
#> b c d
#> 1 NA 3 1
#> 2 NA NA 1
#> 3 2 NA 1
#> 4 2 3 1
#> 5 NA NA 1
# [1] "iteration"
# b c d
# 1 NA 3 1
# 2 NA NA 1
# 3 2 NA 1
# 4 2 3 1
# 5 NA NA 1
# Simultaneous filtering
print(filter_by_missingness(df, row_thresh=0.5, col_thresh=0.5,
target_cols=target_cols, is_qc=is_qc,
filter_order="simultaneous"))
#> b c d
#> 2 NA NA 1
#> 4 2 3 1
#> 5 NA NA 1
# [1] "simultaneous"
# b c d
# 2 NA NA 1
# 4 2 3 1
# 5 NA NA 1
# Column-then-row filtering
print(filter_by_missingness(df, row_thresh=0.5, col_thresh=0.5,
target_cols=target_cols, is_qc=is_qc,
filter_order="col_then_row"))
#> b c d
#> 1 NA 3 1
#> 2 NA NA 1
#> 3 2 NA 1
#> 4 2 3 1
#> 5 NA NA 1
# b c d
# 1 NA 3 1
# 2 NA NA 1
# 3 2 NA 1
# 4 2 3 1
# 5 NA NA 1
# Row-then-column filtering
print(filter_by_missingness(df, row_thresh=0.5, col_thresh=0.5,
target_cols=target_cols, is_qc=is_qc,
filter_order="row_then_col"))
#> a b c d
#> 2 1 NA NA 1
#> 4 1 2 3 1
#> 5 NA NA NA 1
# a b c d
# 2 1 NA NA 1
# 4 1 2 3 1
# 5 NA NA NA 1
# For this simple data set the "iteration" and the "col_then_row" filtering results are the same.