Process feature data
process_data.RdThis function processes feature data given specified metadata. It supports exclusion of features with extreme missingness, various imputation methods, transformation, plate correction, centering, and case-control data handling.
Usage
process_data(
data,
data_meta_features = NULL,
data_meta_samples = NULL,
col_samples,
col_features = NULL,
save = FALSE,
path_out = NULL,
path_outliers = NULL,
exclusion_extreme_feature = FALSE,
missing_pct_feature = NULL,
exclusion_extreme_sample = FALSE,
missing_pct_sample = NULL,
imputation = FALSE,
imputation_method = NULL,
col_LOD = NULL,
transformation = FALSE,
transformation_method = NULL,
outlier = FALSE,
plate_correction = FALSE,
cols_listRandom = NULL,
cols_listFixedToKeep = NULL,
cols_listFixedToRemove = NULL,
col_HeteroSked = NULL,
centre_scale = FALSE,
case_control = FALSE,
col_case_control = NULL
)Arguments
- data
A data frame with the first column as sample IDs and remaining columns containing feature values, where feature IDs are the column names.
- data_meta_features
A data frame containing metadata for the features, including information such as limit of detection and missingness percentage.
- data_meta_samples
A data frame containing metadata for the samples, including information necessary for plate correction and case-control analysis.
- col_samples
A string specifying the column name in
datathat contains sample IDs (e.g.,"Idepic_Bio").- col_features
A string specifying the column name in
data_meta_featuresthat contains feature IDs, which should match the column names ofdata(e.g.,"UNIPROT").- save
A logical for whether you want to save the feature data, plots, and exclusion info. Default is
FALSE/.- path_out
A string specifying the output directory where the processed data will be saved.
- path_outliers
A string specifying the output directory where outlier information will be saved.
- exclusion_extreme_feature
A logical flag indicating whether to exclude features with extreme missingness. Default is
FALSE.- missing_pct_feature
A numeric value specifying the threshold percentage for missingness above which features will be excluded (e.g.,
0.9).- exclusion_extreme_sample
A logical flag indicating whether to exclude samples with extreme missingness. Default is
FALSE.- missing_pct_sample
A numeric value specifying the threshold percentage for missingness above which samples will be excluded (e.g.,
0.9).- imputation
A logical flag indicating whether imputation should be performed. Default is
FALSE.- imputation_method
A string specifying the method to use for imputation. Options include
"LOD","1/5th","KNN","PPCA","median","mean","RF", and"LCMD".- col_LOD
A string specifying the column name in
data_meta_featuresthat contains the limit of detection (LOD) values, required ifimputation_methodis"LOD".- transformation
A logical flag indicating whether transformation should be performed. Default is
FALSE.- transformation_method
A string specifying the method to use for transformation. Options include
"InvRank","Log10","Log10Capped", and"Log10ExclExtremes".- outlier
A logical flag indicating whether outlier exclusion should be performed across features and samples. Default is
FALSE.- plate_correction
A logical flag indicating whether plate correction should be performed. Default is
FALSE.- cols_listRandom
A string or vector specifying columns in
data_meta_samplesto be treated as random effects in plate correction (e.g.,"batch_plate").- cols_listFixedToKeep
A vector specifying columns in
data_meta_samplesto be treated as fixed effects in plate correction and retained in the model (e.g.,c("Center", "Country")).- cols_listFixedToRemove
A vector specifying columns in
data_meta_samplesto be treated as fixed effects in plate correction and removed from the model. Default isNULL.- col_HeteroSked
A string specifying the column in
data_meta_samplesto be used for heteroskedasticity correction.- centre_scale
A logical flag indicating whether to center and scale the data. Default is
FALSE.- case_control
A logical flag indicating whether the data are case-control and if matched samples should be handled accordingly. Default is
FALSE.- col_case_control
A string specifying the column name in
data_meta_samplesthat contains case-control matching information (e.g.,"Match_Caseset").
Details
This function performs several data processing steps (in order):
Excludes features with extreme missingness based on a specified threshold.
Excludes samples with extreme missingness based on a specified threshold.
Imputes missing values using various methods.
Transforms the data using specified methods.
Excludes outlying samples using PCA and LOF.
Handles case-control data to ensure matched samples are treated appropriately.
Corrects for plate effects using specified random and fixed effects.
Centers and scales the data if
centre_scaleisTRUE.