Feature clustering by retention time
feature-clustering.RmdCluster features by retention time
cluster_features_by_retention_time() groups features
that elute together, based on similar retention times (RTs). It returns
a data frame that includes all unclustered features
unchanged, plus one representative per
cluster. The representatives_map output shows
which original features each representative corresponds to. See the
function reference for further details: cluster_features_by_retention_time().
Prerequisites: use imputed and normalised data before applying RT clustering. This procedure assumes no missing values and benefits from stabilised intensity profiles.
The is_qc argument prevents QC-designated rows from
contributing to the clustering. Typically, is_qc should be
set to NULL so that all samples contribute.
How representatives are selected with the “scores” method
The “scores” method clusters features by RT and returns one
representative per RT cluster based on the input score. It
employs get_features_representatives_based_on_scores().
This helper is called internally by
cluster_features_by_retention_time(); you do not need to
invoke it explicitly.
For each RT-based cluster:
- Single-feature clusters: the feature is returned unchanged.
- Multi-feature clusters: the feature with the highest
score is selected as the representative, and
representatives_maprecords all members that it represents.
Example 1: Score-based representatives (mean intensity before normalisation)
# Assume `imputed_df` is your imputed, SERRF-normalised data
# Use pre-normalisation means as scores
target_cols <- OmicsProcessing::resolve_target_cols(clean_df, "@")
scores <- colMeans(pre_normalised_df[, target_cols])
res_scores <- OmicsProcessing::cluster_features_by_retention_time(
df = normalised_df,
target_cols = target_cols,
rt_height = 0.07,
method = "scores",
corr_thresh = 0.75,
scores = scores
)
clustered_df <- res_scores$clustered_df # original features + representatives
rep_map <- res_scores$representatives_map # mapping of representatives to raw featuresThe rt_height parameter defines the maximum RT span for
forming a cluster: features whose RTs differ by less than this value are
grouped. Within each RT‑based group, pairwise correlations between
features are calculated, and the features are further partitioned into
correlation‑based subclusters. The supplied scores
determine the representative feature for each subcluster. In this
example, pre-normalisation mean intensities serve as scores;
consequently, each representative is the feature with the highest mean
intensity before normalisation.
The representatives_map lists representatives as names,
with vectors of raw feature names as their values. In effect, features
with similar retention times are identified, and among those with high
correlation, the feature with the highest pre‑normalisation mean
intensity is selected as the representative.
How representatives are selected with the correlation method
With method = "correlations", the features are first
grouped together by the rt_height (like the scores method).
Then, a secondary correlation-based clustering step is applied within
each RT-defined group. Clustering behaviour is governed by
corr_thresh and cut_height:
RT cluster of size 1: retained unchanged.
-
RT cluster of size 2:
- If
|corr| ≥ corr_thresh, the pair is summarised as a synthetic PC1 feature. - Otherwise, both features are retained.
- If
RT cluster of size ≥3: hierarchical clustering (ClustOfVar) is cut at
cut_height, and each resulting sub-cluster is summarised as a synthetic PC1, aligned to the first feature in that sub-cluster.
In all cases, the representatives_map lists synthetic
feature names and records the raw features incorporated into each
synthetic representative.
Example 2: Correlation-based summarisation within RT clusters
res_corr <- OmicsProcessing::cluster_features_by_retention_time(
df = normalised_df,
target_cols = "@",
is_qc = grepl("^sQC", normalised_df$sample_type),
rt_height = 0.07,
method = "correlations",
cut_height = 0.26,
corr_thresh = 0.75
)
clustered_df_corr <- res_corr$clustered_df
rep_map_corr <- res_corr$representatives_mapTips
- Set
rt_heightto reflect the instrument’s RT precision; smaller values produce more, and tighter, RT clusters. - For
"scores", ensure thatscoresis a named numeric vector aligned withtarget_cols. Using pre-normalisation means is a simple and efficient strategy. - For
"correlations", tunecut_heightandcorr_threshto adjust the merging stringency for correlated features. - Always inspect the
representatives_mapto understand how each representative relates to original features, particularly for reporting, audit trails, or sensitivity analyses.