Integration with patpy and evaluating patient maps
Using MINA representations with patpy¶
MINA models generate sample-level representations, such as patient-by-factor matrices, that can be useful beyond the MINA workflow itself. The patpy integration provides a small adapter class that wraps a precomputed sample representation so it can be used with patpy downstream tools.
patpy is an scverse python package for construction and evaluation of patient representations for single-cell data. It does contain alternatives taylored to specific biological questions.
This is useful when you have already computed a patient map or latent representation in MINA and want to reuse it in patpy for distance calculation, embedding, clustering, plotting, or other sample-level analyses.
Installation¶
The patpy integration is optional. It is not installed with the base MINA package.
# patpy dependencies
import ehrapy as ep # Like scanpy, but for clinical data analysis. Convenient to preprocess the sample-level data
import scanpy as sc
import patpy # Datasets, metrics, and analysis of sample representations
import os # For reading environment variables (e.g. PATPY_GLOSCOPE_GPU)
import pandas as pd # Working with data frames
# Plotting
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap
import seaborn as sns
# For displaying benchmarking metrics
from plottable import ColumnDefinition, Table
from plottable.cmap import normed_cmap
from plottable.plots import bar
import time as _time # For measuring runtime of the methods
# Use a higher figure DPI for crisper plots in the rendered notebook,
# and remove the top/right spines from any seaborn scatter/bar plots.
matplotlib.rcParams['figure.dpi'] = 150
matplotlib.rcParams['savefig.dpi'] = 150
# mina needs
import mina
import scanpy.external as sce
import numpy as np
import decoupler as dc
import math
import mofaflex as mf
import mudata as md
# integration!!
from mina.integration.patpy import PrecomputedSampleRepresentation
We will showcase how to integrate MINA patient maps into patpy and evaluate distinct models against state of the art method GloScope.
We'll use the COMBAT dataset analysis (COvid-19 Multi-omics Blood ATlas (COMBAT) Consortium, Cell, 2022), provided by patpy. This dataset contains 783k peripheral blood mononuclear cells (PBMC) from 140 samples obtained from COVID-19, Flu, and Sepsis patients as well as from healthy donors.
adata = sc.read_h5ad("./data/combat_processed.h5ad")
adata.obs.rename(columns={"Annotation_major_subset": "cell_type"}, inplace=True)
sample_key = 'scRNASeq_sample_ID' # Set a column containing sample IDs from adata.obs for a custom dataset
cell_type_key = "cell_type"
sc.pl.umap(adata, color=["Source","Institute","Pool_ID", "cell_type"], frameon=False, ncols=2)
MINA processing steps¶
We process our data as we would do for any other dataset
# We extract the metadata based on our column that is the reference
# Filter adata to only include ileum
# Process adata_raw
metadata = mina.up.extract_metadata_from_obs(obs=adata.obs, groupby=sample_key, sort=False)
pdata = dc.pp.pseudobulk(adata, sample_col=sample_key, groups_col=cell_type_key,layer = "X_raw_counts")
# Manipulations on pdata
pdata.obs = pdata.obs.set_index(sample_key, drop=False)
pdata.obs.index.name = None
# Make list of adatas per pseudobulk cell type
anndata_dict = mina.up.split_anndata_by_celltype(pdata=pdata, grouping=cell_type_key)
# Adding gene expression total counts - TODO add to upstream as a function
for _cell_type, adata_raw in anndata_dict.items():
# Sum across observations (rows) for each gene (column)
if hasattr(adata_raw.X, "toarray"):
# Sparse matrix case
total_counts = adata_raw.X.sum(axis=0).A1 # returns 1D array
else:
total_counts = adata_raw.X.sum(axis=0) # numpy array
# Add to .var
adata_raw.var["total_counts"] = total_counts
# From all the possible samples, let's ask for at least 40%
n_samples = len(metadata.index)
min_samples = math.trunc(n_samples * 0.4)
# Do filters
mina.up.filter_anndata_by_ncells(anndata_dict, min_cells=10)
mina.up.filter_views_by_samples(anndata_dict, min_rows=min_samples)
mina.up.filter_genes_byexpr(anndata_dict, min_count=5, min_prop=0.4)
mina.up.filter_views_by_genes(anndata_dict, min_genes_per_view=100)
mina.up.filter_samples_by_coverage(anndata_dict, threshold=0, min_prop=0.97)
mina.up.filter_views_by_samples(anndata_dict, min_rows=min_samples)
mina.up.save_raw_counts(anndata_dict, layer_name="raw_counts")
mina.up.norm_log(anndata_dict, target_sum=1e6, exclude_highly_expressed=False, max_value=None, center=True)
mina.up.filter_hvgs(anndata_dict, groupby = None, ngroups_cut = 5)
mina.up.filter_smpls_by_nview(anndata_dict, min_views = 5)
mina.up.utils.append_view_to_var(anndata_dict)
Raw counts saved in the 'raw_counts' layer for each AnnData object. Normalization, log-transformation, and scaling complete for all AnnData objects with target_sum = 1000000.0.
Here we add the essential information to group the model
model_data = md.MuData(anndata_dict)
model_data.obs["study_group"] = model_data.obs.index.map(metadata["Pool_ID"])
simple_model = mf.terms.MofaFlex(n_factors=10, weight_prior="SpikeSlab", init_factors="pca") # type: ignore
simple_model.fit(model_data, seed=42, save_path=False,
lr=0.01, early_stopper_patience=1000, likelihoods="Normal",plot_data_overview=True)
2026-06-01 08:40:19 | [WARNING] Device cuda is not available. Using default device: cpu
2026-06-01 08:40:25 | [INFO] Initializing factors using 'pca' method... 0%| | 0/10000 [00:00<?, ?epoch/s]2026-06-01 08:40:25 | [INFO] Guessed max_plate_nesting = 2 57%|█████▋ | 5679/10000 [05:35<04:34, 15.77epoch/s, Loss=3.79e+3]2026-06-01 08:46:01 | [INFO] Training converged after 5679 epochs. 57%|█████▋ | 5679/10000 [05:36<04:15, 16.90epoch/s, Loss=3.79e+3]
In order to transform our mina model into a patpy compatible class, one needs to use PrecomputedSampleRepresentation adding the latent space, and the metadata.
a_simple_model = mina.down.model_to_anndata(
anndata_dict=anndata_dict,
metadata=metadata,
model=simple_model,
)
# Important to copy!! Otherwise, when we do the next step of filtering,
# it will mess with the original anndata that is used for the model and
# thus mess with the model itself
df = pd.DataFrame(a_simple_model.X.copy(),
index=a_simple_model.obs_names.copy(),
columns=a_simple_model.var_names.copy())
ppy_simple_model = PrecomputedSampleRepresentation(
representation=df,
metadata=metadata.copy(),
metric="euclidean",
)
Grouped MOFA¶
Taking PoolID as grouping variable
grouped_model = mf.terms.MofaFlex(n_factors=10, weight_prior="SpikeSlab", init_factors="pca") # type: ignore
grouped_model.fit(model_data, seed=42, save_path=False,
lr=0.01, early_stopper_patience=1000, likelihoods="Normal",
plot_data_overview=False, group_by="study_group")
2026-06-01 08:51:59 | [WARNING] Device cuda is not available. Using default device: cpu 2026-06-01 08:52:17 | [INFO] Initializing factors using 'pca' method... 0%| | 0/10000 [00:00<?, ?epoch/s]2026-06-01 08:52:19 | [INFO] Guessed max_plate_nesting = 2 42%|████▏ | 4235/10000 [06:59<08:53, 10.81epoch/s, Loss=3.54e+3]2026-06-01 08:59:18 | [INFO] Training converged after 4235 epochs. 42%|████▏ | 4235/10000 [06:59<09:30, 10.10epoch/s, Loss=3.53e+3]
a_grouped_model = mina.down.model_to_anndata(
anndata_dict=anndata_dict,
metadata=metadata,
model=grouped_model,
)
df = pd.DataFrame(a_grouped_model.X.copy(),
index=a_grouped_model.obs_names.copy(),
columns=a_grouped_model.var_names.copy())
ppy_grouped_model = PrecomputedSampleRepresentation(
representation=df,
metadata=metadata.copy(),
metric="euclidean",
)
Grouped MOFA with additional compositional information¶
We can in addition make a classic model with additional compositional information, this will capture two different aspects of tissue organization
from skbio.stats.composition import clr, multi_replace
from anndata import AnnData
def get_cell_count(adata, sample_key, cell_type_key):
# Compute cell type counts per sample
cell_counts = (
adata.obs
.groupby([sample_key, cell_type_key])
.size()
.unstack(fill_value=0)
)
# Convert counts to proportions
prop_data = cell_counts.div(cell_counts.sum(axis=1), axis=0)
# CLR transformation
prop_data_clr = clr(multi_replace(prop_data.values))
prop_data_clr = pd.DataFrame(prop_data_clr, index=prop_data.index, columns=prop_data.columns)
clr_props_adata = AnnData(prop_data_clr)
clr_props_adata.obs[sample_key] = clr_props_adata.obs_names
clr_props_adata.var["compositions"] = clr_props_adata.var_names
return(clr_props_adata)
First we infer the compositional information from the adata and add it to our dictionary of views
clr_props_adata = get_cell_count(adata, sample_key, cell_type_key)
mina.up.utils.append_view_to_var({"clr_comps":clr_props_adata})
anndata_dict["clr_comps"] = clr_props_adata
model_data = md.MuData(anndata_dict)
model_data.obs["study_group"] = model_data.obs.index.map(metadata["Pool_ID"])
groupedclr_model = mf.terms.MofaFlex(n_factors=10, weight_prior="SpikeSlab", init_factors="pca") # type: ignore
groupedclr_model.fit(model_data, seed=42, save_path=False,
lr=0.01, early_stopper_patience=1000, likelihoods="Normal",
plot_data_overview=False, group_by="study_group")
2026-06-01 09:01:19 | [WARNING] Device cuda is not available. Using default device: cpu 2026-06-01 09:01:41 | [INFO] Initializing factors using 'pca' method... 0%| | 0/10000 [00:00<?, ?epoch/s]2026-06-01 09:01:44 | [INFO] Guessed max_plate_nesting = 2 37%|███▋ | 3706/10000 [06:48<10:17, 10.20epoch/s, Loss=3.48e+3]2026-06-01 09:08:32 | [INFO] Training converged after 3707 epochs. 37%|███▋ | 3707/10000 [06:48<11:33, 9.07epoch/s, Loss=3.48e+3]
a_clrgrouped_model = mina.down.model_to_anndata(
anndata_dict=anndata_dict,
metadata=metadata,
model=groupedclr_model,
)
df = pd.DataFrame(a_clrgrouped_model.X.copy(),
index=a_clrgrouped_model.obs_names.copy(),
columns=a_clrgrouped_model.var_names.copy())
ppy_clrgrouped_model = PrecomputedSampleRepresentation(
representation=df,
metadata=metadata.copy(),
metric="euclidean",
)
Comparing with GloScope¶
To show different ways of comparing the models, we select GloScope, state of the art in pure patient representation.
gloscope = patpy.tl.GloScope_py(
sample_key=sample_key,
cell_group_key=cell_type_key,
layer="X_scVI_batch",
k=25,
use_gpu=False,
)
gloscope.prepare_anndata(adata)
D = gloscope.calculate_distance_matrix(force=True)
Adding all models into a "single representation" for evaluation¶
Although not the most elegant solution, in order to be able to take advantage of patpy evaluation frameworks is to add all results into a single anndata. We will use the anndata generated from MINA to ensure that all methods, include the relevant samples
import warnings
def store_representation(meta_adata, method, method_name):
"""
Store a sample representation in meta_adata after subsetting to samples
available both in meta_adata.obs_names and method.samples.
The output AnnData is ordered according to meta_adata.obs_names.
"""
representation_samples = pd.Index(method.samples)
meta_samples = pd.Index(meta_adata.obs_names)
if representation_samples.has_duplicates:
duplicated = representation_samples[representation_samples.duplicated()].unique().tolist()
raise ValueError(f"`method.samples` contains duplicated sample IDs: {duplicated}")
# Keep only samples present in both objects, preserving meta_adata.obs_names order
common_samples = meta_samples.intersection(representation_samples, sort=False)
if len(common_samples) == 0:
raise ValueError("No overlapping samples between `meta_adata.obs_names` and `method.samples`.")
missing_from_representation = meta_samples.difference(representation_samples)
if len(missing_from_representation) > 0:
warnings.warn(
f"{len(missing_from_representation)} samples in `meta_adata.obs_names` are not present "
f"in `method.samples` and will be removed.",
UserWarning,
)
# Subset meta_adata to available samples
meta_adata = meta_adata[common_samples].copy()
# Integer positions of common samples in method output
sample_idx = representation_samples.get_indexer(common_samples)
sample_representation = np.asarray(method.sample_representation)
distance_matrix = np.asarray(method.calculate_distance_matrix())
umap_embedding = np.asarray(method.embed("UMAP"))
if sample_representation.shape[0] != len(representation_samples):
raise ValueError(
"`method.sample_representation` rows do not match the number of `method.samples`."
)
if distance_matrix.shape != (len(representation_samples), len(representation_samples)):
raise ValueError(
"`method.calculate_distance_matrix()` must return a square matrix with one row/column "
"per sample in `method.samples`."
)
if umap_embedding.shape[0] != len(representation_samples):
raise ValueError(
"`method.embed('UMAP')` rows do not match the number of `method.samples`."
)
meta_adata.obsm[method_name] = sample_representation[sample_idx, :]
meta_adata.obsm[f"{method_name}_distances"] = distance_matrix[sample_idx][:, sample_idx]
meta_adata.obsm[f"X_umap_{method_name}"] = umap_embedding[sample_idx, :]
sc.pp.neighbors(
meta_adata,
use_rep=f"{method_name}_distances",
key_added=f"{method_name}_neighbors",
metric="precomputed",
)
if "sample_representations" not in meta_adata.uns:
meta_adata.uns["sample_representations"] = []
if method_name not in meta_adata.uns["sample_representations"]:
meta_adata.uns["sample_representations"].append(method_name)
return meta_adata
meta_adata = a_simple_model.copy()
meta_adata = store_representation(meta_adata, ppy_simple_model, "MOFA_vanilla")
meta_adata = store_representation(meta_adata, gloscope, "GloScope")
meta_adata = store_representation(meta_adata, ppy_grouped_model, "MOFA_grouped")
meta_adata = store_representation(meta_adata, ppy_clrgrouped_model, "MOFA_grouped_clr")
Benchmark metrics¶
The benchmark evaluates sample representations by asking whether distances between samples preserve selected metadata properties.
- Information retention measures whether clinical or biological covariates can be recovered from the nearest neighbours of each sample in the representation. For categorical variables, the predicted label is obtained by majority vote among nearest neighbours and scored with a corrected macro-F1 score; for numerical or ordinal variables, neighbour-averaged values are compared with the true values using Spearman correlation.
- Batch removal uses the same logic but applies it to technical covariates, defining a good representation as one where batch variables are difficult to recover.
benchmark_schema = {
"technical": {
"Institute": "classification",
"Pool_ID": "classification",
},
"clinical": {
"Death28": "classification",
"Outcome": "ranking",
"Source": "classification",
"SARSCoV2PCR": "classification",
},
}
results = patpy.tl.evaluation.knn_prediction_score(
meta_adata,
benchmark_schema,
representations=meta_adata.uns["sample_representations"],
n_neighbors=5,
reverse_technical_score=True, # To make higher = better
)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def plot_metric_heatmap_by_covariate_type(
df,
score_col="score",
method_col="representation",
metric_col="covariate",
covariate_type_col="covariate_type",
aggfunc="mean",
cmap="viridis",
figsize=None,
annotate=True,
annot_fmt=".2f",
annot_fontsize=8,
):
"""
Heatmap with:
- rows: method / representation
- columns: metric / covariate, grouped by covariate_type
- values: score
"""
heatmap_df = (
df.pivot_table(
index=method_col,
columns=[covariate_type_col, metric_col],
values=score_col,
aggfunc=aggfunc,
)
.sort_index(axis=0)
.sort_index(axis=1)
)
if figsize is None:
figsize = (
max(6, 0.45 * heatmap_df.shape[1]),
max(3, 0.35 * heatmap_df.shape[0]),
)
fig, ax = plt.subplots(figsize=figsize)
values = heatmap_df.values
im = ax.imshow(
values,
aspect="auto",
cmap=cmap,
)
# Add text inside cells
if annotate:
vmin = np.nanmin(values)
vmax = np.nanmax(values)
threshold = vmin + 0.5 * (vmax - vmin)
for i in range(heatmap_df.shape[0]):
for j in range(heatmap_df.shape[1]):
value = heatmap_df.iloc[i, j]
if pd.notna(value):
text_color = "white" if value < threshold else "black"
ax.text(
j,
i,
f"{value:{annot_fmt}}",
ha="center",
va="center",
fontsize=annot_fontsize,
color=text_color,
)
# Row labels
ax.set_yticks(np.arange(heatmap_df.shape[0]))
ax.set_yticklabels(heatmap_df.index)
# Column labels: show only metric/covariate names
metric_labels = heatmap_df.columns.get_level_values(metric_col)
ax.set_xticks(np.arange(heatmap_df.shape[1]))
ax.set_xticklabels(metric_labels, rotation=90)
# Add covariate_type group labels and separators
covariate_types = heatmap_df.columns.get_level_values(covariate_type_col)
boundaries = []
group_centers = []
for cov_type in covariate_types.unique():
positions = np.where(covariate_types == cov_type)[0]
center = (positions[0] + positions[-1]) / 2
group_centers.append((center, cov_type))
end = positions[-1]
if end < heatmap_df.shape[1] - 1:
boundaries.append(end + 0.5)
for boundary in boundaries:
ax.axvline(boundary, color="white", linewidth=2)
# Covariate type labels above the heatmap
for center, cov_type in group_centers:
ax.text(
center,
-1.2,
cov_type,
ha="center",
va="bottom",
fontsize=10,
fontweight="bold",
)
ax.set_xlabel(metric_col)
ax.set_ylabel(method_col)
cbar = fig.colorbar(im, ax=ax)
cbar.set_label(score_col)
plt.tight_layout()
return fig, ax, heatmap_df
fig, ax, heatmap_df = plot_metric_heatmap_by_covariate_type(results)
In the plot above, individual values represent the evaluation of the recovery of a covariate of interest by the latent space. In practical terms, the patient maps do capture the variability of patients across distinct relevant features with similar trends, as mentioned by the original GloScope paper. We would like to highlight other differences between the models.
Input and modeling assumptions¶
| Tool | Statistical or ML principle | Data type | Flexibility of input | Missing data | Batch correction |
|---|---|---|---|---|---|
| MINA | Grouped factor analysis with structured priors through MOFA-FLEX. | Multiview patient representations, i.e. collections of patients × features matrices. | By default, pseudobulk gene expression data for each cell type of interest. However, cellular composition, spatial organization, and other patient-level representations can also be included in the model. | Patients can miss complete views or individual features within views. | Explicitly managed in the model. |
| GloScope | Represents each biological sample as an empirical probability distribution of its cells in a shared low-dimensional cell-state space, then compares samples using pairwise statistical divergence between these estimated distributions. | Low-dimensional representation of single-cell data, such as PCA or scVI. | Any low-dimensional representation can be used. Joint multimodal latent spaces from the same samples could also be included. | Missingness is handled in the low-dimensional representation. However, it is treated as a characteristic of the patient/sample distribution. | Implicitly managed within the low-dimensional representation. |
Output and interpretability¶
| Tool | Latent space | Recovered variance | Interpretability of latent space | Pairwise distances |
|---|---|---|---|---|
| MINA | Yes. | Yes, both at the level of latent variables and views. | Yes, both at the level of features and views. | Yes, if calculated from the latent variables. |
| GloScope | Yes, built from the pairwise distance matrix, for example with multidimensional scaling. | No. | No. | Yes. |
Some comments regarding the metrics¶
Evaluation based on information retention should be interpreted with caution because it measures how well selected annotations can be recovered from local neighbourhoods in the full representation, rather than the intrinsic quality of the patient map. In other words, the metric asks whether samples that are close in the latent space also share known clinical or technical covariates.
This is useful as a diagnostic, but it should not be treated as a general measure of representation quality. In the benchmark results, the tested methods recover several clinical and technical covariates at broadly similar levels, suggesting that small numerical differences are unlikely to provide a robust basis for method selection. More importantly, these scores can disadvantage richer representations. If only a subset of latent variables captures the available clinical metadata, while other axes capture orthogonal biological variation, the full-space neighbourhood structure may no longer be dominated by the annotated covariates. A lower information-retention score may therefore reflect the presence of additional, unannotated structure rather than poorer representation quality.
For this reason, we treat these metrics as annotation-alignment diagnostics. We then complement them with criteria that evaluate properties of the patient map itself, including explained variance, interpretability, and the biological coherence of the recovered latent factors.
Although it is reasonable to expect measured clinical variables to be reflected in molecular profiles, these variables are unlikely to explain all relevant gene expression variability. In fact, one of the motivations for generating molecular profiles is to capture disease-relevant structure that is not fully represented by standard clinical annotations. The following example illustrates this point.
MOFA grouped with compositions¶
We reconstructed multicellular gene expression and cellular composition using a grouped MOFA model with 10 factors. However, not all factors were associated with the clinical or technical variables included in the benchmark. In particular, Factors 6, 7, and 8 did not show clear associations with the selected variables of interest. This raises a practical question: if we remove these factors from the representation, do the benchmark results change?
# Get the p-value matrix for the associations of interest (adjusted)
all_assocs = mina.down.get_pval_matrix(a_clrgrouped_model,["Institute", "Pool_ID", "Death28", "Outcome", "Source", "SARSCoV2PCR"])
mina.pl.pl.plot_pval_tiles(all_assocs, title="10 factors - P-value matrix",star_threshold=0.05)
plt.show()
fs = ["Factor" + str(i) for i in range(a_clrgrouped_model.X.shape[1]+1) if i not in [0,6,7,8]]
red_model = a_clrgrouped_model[:,fs].copy()
df = pd.DataFrame(red_model.X.copy(),
index=red_model.obs_names.copy(),
columns=red_model.var_names.copy())
ppy_red_model = PrecomputedSampleRepresentation(
representation=df,
metadata=metadata.copy(),
metric="euclidean",
)
meta_adata = store_representation(meta_adata, ppy_red_model, "MOFA_grouped_clr_red")
results = patpy.tl.evaluation.knn_prediction_score(
meta_adata,
benchmark_schema,
representations=meta_adata.uns["sample_representations"],
n_neighbors=5,
reverse_technical_score=True, # To make higher = better
)
fig, ax, heatmap_df = plot_metric_heatmap_by_covariate_type(results)
Compared with the original grouped model with compositions, the reduced representation improves prediction of death after 28 days (Death28) and performs better than GloScope for the remaining benchmarked covariates, although the absolute differences between methods are small and should be interpreted cautiously. This result suggests that removing factors not aligned with the selected metadata can improve the information-retention score.
However, this improvement comes at a cost. To quantify this, we examined the explained variance captured by each factor across views and groups. When inspecting the top view–factor pairs, it becomes clear that excluding Factors 6, 7, and 8 removes a substantial fraction of structured variability from some pools and cell types, reaching up to 50% in specific cases. Two of these factors appear to be associated with Institute, suggesting that part of this variability may reflect technical or cohort-related structure. However, one factor is not explained by the available covariates and appears to account for a large fraction of the excluded variability.
a_clrgrouped_model[:,["Factor6", "Factor7", "Factor8"]].var.sum(0).sort_values(ascending=False).head(10)
PB:gPlexE 0.521059 PB:gPlexJ 0.367761 PB:gPlexA 0.359520 MAIT:gPlexB 0.343469 PB:gPlexB 0.341387 PB:gPlexD 0.283954 PB:gPlexH 0.269391 PB:gPlexG 0.257348 DC:gPlexD 0.256202 PB:gPlexF 0.214500 dtype: float64
a_clrgrouped_model[:,["Factor6"]].var.sum(0).sort_values(ascending=False).head(10)
PB:gPlexE 0.516666 PB:gPlexA 0.353326 PB:gPlexJ 0.344216 PB:gPlexB 0.333276 PB:gPlexD 0.275638 PB:gPlexG 0.234754 PB:gPlexH 0.230460 PB:gPlexF 0.209301 MAIT:gPlexB 0.150717 DP:gPlexK 0.140664 dtype: float64
This example highlights the limitation of relying exclusively on metadata-recovery metrics. A representation can score better after removing latent axes that are not aligned with the benchmark annotations, even when those axes explain substantial structure in the data. The relevant question is therefore not only whether a patient map recovers known clinical variables, but also whether it captures interpretable and biologically meaningful variation beyond those variables.
Thus, richer representations should not be penalized solely because they capture orthogonal variability. Instead, this additional structure should be evaluated directly through explained variance, factor interpretability, association with technical covariates, and biological coherence.