Process feature data — process_data • OmicsProcessing

This function processes feature data given specified metadata. It supports exclusion of features with extreme missingness, various imputation methods, transformation, plate correction, centering, and case-control data handling.

Usage

process_data(
  data,
  data_meta_features = NULL,
  data_meta_samples = NULL,
  col_samples,
  col_features = NULL,
  save = FALSE,
  path_out = NULL,
  path_outliers = NULL,
  exclusion_extreme_feature = FALSE,
  missing_pct_feature = NULL,
  exclusion_extreme_sample = FALSE,
  missing_pct_sample = NULL,
  imputation = FALSE,
  imputation_method = NULL,
  col_LOD = NULL,
  transformation = FALSE,
  transformation_method = NULL,
  outlier = FALSE,
  plate_correction = FALSE,
  cols_listRandom = NULL,
  cols_listFixedToKeep = NULL,
  cols_listFixedToRemove = NULL,
  col_HeteroSked = NULL,
  centre_scale = FALSE,
  case_control = FALSE,
  col_case_control = NULL
)

Arguments

data: A data frame with the first column as sample IDs and remaining columns containing feature values, where feature IDs are the column names.
data_meta_features: A data frame containing metadata for the features, including information such as limit of detection and missingness percentage.
data_meta_samples: A data frame containing metadata for the samples, including information necessary for plate correction and case-control analysis.
col_samples: A string specifying the column name in data that contains sample IDs (e.g., "Idepic_Bio").
col_features: A string specifying the column name in data_meta_features that contains feature IDs, which should match the column names of data (e.g., "UNIPROT").
save: A logical for whether you want to save the feature data, plots, and exclusion info. Default is FALSE/.
path_out: A string specifying the output directory where the processed data will be saved.
path_outliers: A string specifying the output directory where outlier information will be saved.
exclusion_extreme_feature: A logical flag indicating whether to exclude features with extreme missingness. Default is FALSE.
missing_pct_feature: A numeric value specifying the threshold percentage for missingness above which features will be excluded (e.g., 0.9).
exclusion_extreme_sample: A logical flag indicating whether to exclude samples with extreme missingness. Default is FALSE.
missing_pct_sample: A numeric value specifying the threshold percentage for missingness above which samples will be excluded (e.g., 0.9).
imputation: A logical flag indicating whether imputation should be performed. Default is FALSE.
imputation_method: A string specifying the method to use for imputation. Options include "LOD", "1/5th", "KNN", "PPCA", "median", "mean", "RF", and "LCMD".
col_LOD: A string specifying the column name in data_meta_features that contains the limit of detection (LOD) values, required if imputation_method is "LOD".
transformation: A logical flag indicating whether transformation should be performed. Default is FALSE.
transformation_method: A string specifying the method to use for transformation. Options include "InvRank", "Log10", "Log10Capped", and "Log10ExclExtremes".
outlier: A logical flag indicating whether outlier exclusion should be performed across features and samples. Default is FALSE.
plate_correction: A logical flag indicating whether plate correction should be performed. Default is FALSE.
cols_listRandom: A string or vector specifying columns in data_meta_samples to be treated as random effects in plate correction (e.g., "batch_plate").
cols_listFixedToKeep: A vector specifying columns in data_meta_samples to be treated as fixed effects in plate correction and retained in the model (e.g., c("Center", "Country")).
cols_listFixedToRemove: A vector specifying columns in data_meta_samples to be treated as fixed effects in plate correction and removed from the model. Default is NULL.
col_HeteroSked: A string specifying the column in data_meta_samples to be used for heteroskedasticity correction.
centre_scale: A logical flag indicating whether to center and scale the data. Default is FALSE.
case_control: A logical flag indicating whether the data are case-control and if matched samples should be handled accordingly. Default is FALSE.
col_case_control: A string specifying the column name in data_meta_samples that contains case-control matching information (e.g., "Match_Caseset").

Value

The processed data is returned and saved as a .rds file to the specified output directory.

Details

This function performs several data processing steps (in order):

Excludes features with extreme missingness based on a specified threshold.
Excludes samples with extreme missingness based on a specified threshold.
Imputes missing values using various methods.
Transforms the data using specified methods.
Excludes outlying samples using PCA and LOF.
Handles case-control data to ensure matched samples are treated appropriately.
Corrects for plate effects using specified random and fixed effects.
Centers and scales the data if centre_scale is TRUE.