Cluster feature groups based on correlations
cluster_features_based_on_correlations.RdGiven a list of pre-defined feature groups, construct a data set of representative features by collapsing groups with high correlations into synthetic variables.
Arguments
- df
A tibble or data.frame containing the feature columns (and possibly other variables not used here). All variables referenced in
cluster_feature_listmust be columns ofdf.- cluster_feature_list
A list where each element is a character vector giving the names of the features that belong to one initial cluster. Feature names are typically of the form
"RT@mass", whereRTandmassare real numbers corresponding to retention time and mass, but any valid column names are accepted.- cut_height
Numeric. Cut-off height for the correlation-based clustering within groups of size three or more (as implemented in ClustOfVar). A suggested value is
0.26, which corresponds roughly to a pairwise correlation of about0.75. You can use a dedicated helper (e.g.simul_eval_link_hhh_corr) to explore the correspondence between pairwise correlations and the tree height returned byClustOfVar::hclustvar().- corr_thresh
Numeric. Cut-off for the absolute pairwise correlation used when the number of features in a group is exactly two. If the absolute correlation between the two features is at least
corr_thresh, they are replaced by a single synthetic feature (first principal component); otherwise both original features are kept.
Value
A list with the following components:
- clustered_df
A tibble with one row per observation and one column per final representative feature. For clusters of size one, the column is the original feature. For size two with correlation above
corr_thresh, the column is the first principal component of the two features. For larger clusters, columns are synthetic variables returned by ClustOfVar, oriented to have a positive correlation with the first feature of the corresponding sub-cluster.- representatives_map
A named list mapping each synthetic feature (columns whose names start with
"SynthFeat@") to the character vector of original feature names that it summarises. Original features that are kept as is do not appear in this map.
Details
For groups of size one, the original feature is kept as is. For groups of size two, the decision is based on their pairwise correlation. For larger groups, clustering is based on the decrease in homogeneity as in the ClustOfVar package. The homogeneity of a cluster is the squared correlation (for quantitative variables) between the variables and the cluster centre (the first principal component).
This function is typically used after a pre-grouping step (for example,
after retention-time-based grouping). Each element of
cluster_feature_list corresponds to one such pre-defined group and
may contain one, two, or more feature names; all cases are handled.
For groups of size three or more, a hierarchical clustering tree is first
obtained using ClustOfVar::hclustvar(). The tree is then cut at
a data-driven position derived from cut_height, and
ClustOfVar::cutreevar() is used to obtain synthetic scores for the
resulting sub-clusters. Each score is multiplied by \(+1\) or \(-1\)
so that it has a positive correlation with the first feature in its
sub-cluster.
Examples
set.seed(123)
data_tbl <- tibble::tibble(
`100@150` = rnorm(20),
`100@151` = rnorm(20),
`101@200` = rnorm(20)
)
# Pre-defined feature groups (e.g. after a retention-time grouping step)
cluster_feature_list <- list(
c("100@150", "100@151"), # group of size 2
"101@200" # group of size 1
)
res <- cluster_features_based_on_correlations(
df = data_tbl,
cluster_feature_list = cluster_feature_list,
cut_height = 0.26,
corr_thresh = 0.75
)
res$clustered_df
#> 100@150 100@151 101@200
#> 1 -0.56047565 -1.06782371 -0.69470698
#> 2 -0.23017749 -0.21797491 -0.20791728
#> 3 1.55870831 -1.02600445 -1.26539635
#> 4 0.07050839 -0.72889123 2.16895597
#> 5 0.12928774 -0.62503927 1.20796200
#> 6 1.71506499 -1.68669331 -1.12310858
#> 7 0.46091621 0.83778704 -0.40288484
#> 8 -1.26506123 0.15337312 -0.46665535
#> 9 -0.68685285 -1.13813694 0.77996512
#> 10 -0.44566197 1.25381492 -0.08336907
#> 11 1.22408180 0.42646422 0.25331851
#> 12 0.35981383 -0.29507148 -0.02854676
#> 13 0.40077145 0.89512566 -0.04287046
#> 14 0.11068272 0.87813349 1.36860228
#> 15 -0.55584113 0.82158108 -0.22577099
#> 16 1.78691314 0.68864025 1.51647060
#> 17 0.49785048 0.55391765 -1.54875280
#> 18 -1.96661716 -0.06191171 0.58461375
#> 19 0.70135590 -0.30596266 0.12385424
#> 20 -0.47279141 -0.38047100 0.21594157
res$representatives_map
#> list()