Multicellular Factor Analysis guided by pathway annotations
Multicellular factor analysis informed by biological prior knowledge with MuVI¶
NOTE
This notebook assumes familiarity with multicellular factor analysis and pathway activity inference.
In previous vignettes, we have shown how multicellular coordination can be inferred from pathway activities directly. However, this assumes that your prior knowledge captures all the variability you care about. This has certain limitations:
- gene sets used to infer activities are not independent, duplicating information
- gene sets are not context specific
What if you would like to use biological prior knowledge to inform factor reconstruction, but at the same time let the model find other sources of variability? Here, we show how to build these type of models with MuVI (Multi-View latent variable modeling with domain-informed priors) .
Core idea¶
MUVI extends multi-view factor analysis by incorporating prior knowledge directly into the factor loadings, enabling the model to align latent factors with predefined feature sets such as pathways.
In this setting:
- each view represents a cell type or compartment
- features correspond to genes
- prior knowledge defines relationships between genes, pathways and latent factors
The model learns:
- guiding latent factors informed by pathway annotations (which can be updated). This results in factors that are directly interpretable as pathway-level programs, rather than requiring post hoc enrichment analysis.
- unguided latent factors capturing coordinated variation across samples and views not dependent on annotations
Why MUVI for pathway-based analysis?¶
Unlike standard factor models, MUVI introduces structured sparsity and domain-informed priors on the loadings:
- factors are encouraged to align with predefined pathways
- irrelevant features are shrunk toward zero
- prior knowledge is used to guide interpretation, while remaining robust to noise or misspecification
This provides:
- improved interpretability of latent factors
- more stable recovery of biological signals
- reduced need for downstream enrichment steps
A limitation of this model, however, is that gene sets can only contain features changing in the same direction. Meaning, that weighted gene sets can't be directly used.
Relation to standard multicellular factor analysis¶
Conceptually, this framework follows the same structure as multicellular factor analysis:
- samples are represented across multiple views (cell types)
- variation is decomposed into latent factors
However, MUVI introduces an additional layer:
- priors on loadings link factors to pathways
- factors are therefore annotated during model training, not after
This shifts the interpretation from:
- gene-level variation ā post hoc pathway enrichment
to:
- pathway-level variation ā directly encoded in the model
import numpy as np
import pandas as pd
import anndata as ad
import decoupler as dc
import matplotlib.pyplot as plt
import seaborn as sns
import math
import scanpy as sc
import mudata as md
import mofaflex as mf
import os
import mina
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
Importing the dtw module. When using in academic works please cite: T. Giorgino. Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package. J. Stat. Soft., doi:10.18637/jss.v031.i07.
Application to PBMC data¶
To illustrate the use of pathway information for multicellular modeling, here we apply MINA to a dataset of peripheral blood mononuclear cells (PBMCs) of eight individuals. In this dataset cells were sequenced before and after treatment with interferon beta (Kang, et al., 2018).
Interferon beta is a type I interferon that signals through the Janus kinase/signal transducer and activator of transcription (JAK-STAT) pathway to trigger antiviral, antiproliferative, and immunomodulatory effects.
In PROGENy we have the gene expression changes caused by the stimulation of JAK-STAT, thus it is safe to assume that if we build a multicellular factor analysis guided by the knowledge encoded in these pathways, we should be able to recover factors encoding JAK-STAT that at the same time capture stimulation.
The code below shows the usual processing steps needed for this analysis
adata = ad.read_h5ad("data/kang.h5ad")
# filter cells and genes
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sample_key = 'replicate'
condition_key = 'label'
groupby = 'cell_type'
# create biosample_id by merging sample_key and condition_key
adata.obs[sample_key] = adata.obs[sample_key].astype(str)
adata.obs[condition_key] = adata.obs[condition_key].astype(str)
adata.obs["biosample_id"] = adata.obs[sample_key] + "_" + adata.obs[condition_key]
sc.pl.umap(adata, color=[condition_key, sample_key, groupby], frameon=False, ncols=2)
... storing 'label' as categorical ... storing 'replicate' as categorical ... storing 'biosample_id' as categorical
metadata = mina.up.pp.extract_metadata_from_obs(obs = adata.obs,
groupby= 'biosample_id',
sort= False)
pdata = dc.pp.pseudobulk(adata, sample_col='biosample_id', groups_col="cell_type")
# Manipulations on pdata
pdata.obs = pdata.obs.set_index('biosample_id', drop=False)
pdata.obs.index.name = None
anndata_dict = mina.up.split_anndata_by_celltype(pdata=pdata, grouping="cell_type")
# Adding gene expression total counts - TODO add to upstream as a function
for cell_type, adata in anndata_dict.items():
# Sum across observations (rows) for each gene (column)
if hasattr(adata.X, "toarray"):
# Sparse matrix case
total_counts = adata.X.sum(axis=0).A1 # returns 1D array
else:
total_counts = adata.X.sum(axis=0) # numpy array
# Add to .var
adata.var['total_counts'] = total_counts
mina.up.filt.filter_anndata_by_ncells(anndata_dict, min_cells=10)
# From all the possible samples, let's ask for at least 40%
n_samples = len(metadata.index)
min_samples = math.trunc((n_samples * 0.4))
mina.up.filter_views_by_samples(anndata_dict, min_rows=min_samples)
mina.up.filter_genes_byexpr(anndata_dict, min_count=5, min_prop=0.4)
mina.up.filter_views_by_genes(anndata_dict, min_genes_per_view=100)
mina.up.filter_samples_by_coverage(anndata_dict, threshold=0, min_prop=0.90)
mina.up.save_raw_counts(anndata_dict, layer_name="raw_counts")
mina.up.norm_log(anndata_dict, target_sum=1e6, exclude_highly_expressed=False, max_value=None, center=True)
Raw counts saved in the 'raw_counts' layer for each AnnData object. Normalization, log-transformation, and scaling complete for all AnnData objects with target_sum = 1000000.0.
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:795: UserWarning:
AnnData expects .obs.index to contain strings, but got values like:
['patient_1015_ctrl', 'patient_1015_stim', 'patient_1016_ctrl', 'patient_1016_stim', 'patient_101_ctrl']
Inferred to be: categorical
Using decoupler + MuVI to run functional multicellular factor analysis¶
For this example, we will use the pathway information of Hypoxia and JAK-STAT in PROGENy. First, we pick all genes from which we know we should observe an upregulation upon pathway activation.
progeny = dc.op.progeny(organism="human", top=700)
#Genes upregulated in JAK-STAT pathway
progeny_up = progeny[progeny["weight"] > 0]
progeny_up = progeny_up[progeny_up["source"].isin(["JAK-STAT"])]
progeny_up["source"] = "up_" + progeny_up["source"]
#Genes downregulated in JAK-STAT pathway
progeny_down = progeny[progeny["weight"] < 0]
progeny_down = progeny_down[progeny_down["source"].isin(["JAK-STAT"])]
progeny_down["source"] = "down_" + progeny_down["source"]
#bindrows of progeny_up and progeny_down
progeny_up_down = pd.concat([progeny_up, progeny_down], axis=0)
In MINA, we have facilitated the encoding of prior knowledge in the multiview structure, by allowing users to use decoupler gene sets obtained from Omnipath.
One simply needs to loop over all views and provide the long dataframe with pathway annotations, this will create a membership matrix within each view. The function deals with the overlap of features between your data and the prior
for cell_type, adata in anndata_dict.items():
mina.up.make_membership_matrix(anndata_dict[cell_type], progeny_up_down, gene_col="target", pathway_col="source")
anndata_dict["B cells"].varm["pathway_membership"].head()
| up_JAK-STAT | down_JAK-STAT | |
|---|---|---|
| index | ||
| NOC2L | False | False |
| HES4 | False | False |
| ISG15 | True | False |
| TNFRSF18 | False | False |
| SDF4 | False | False |
Before fitting the model, we need to append the view to the features (no worries, this doesn't break your membership matrix).
mina.up.append_view_to_var(anndata_dict)
Fitting the model with MOFA-FLEX¶
Using MUVI requires defining:
- a prior matrix linking features (e.g. genes or pathways) to latent factors (encoded in our varm)
- hyperparameters controlling the strength of the prior (i.e. how strongly factors are encouraged to follow predefined structure)
- Working with positive-only feature loadings
- Working with positive-only
Note how we create ann_dict that tells the model which views to guide with which prior. Here, we use the same name but different feature set, because we expect not all genes across all views to be shared. But you can think of any combination of guided knowledge (e.g. one view only, distinct sets per view, etc.)
Here by default we generate a model with two guided factors (one Hypoxia, one JAK-STAT), and two unguided factors
ann_dict = {k: "pathway_membership" for k in anndata_dict}
mdata = md.MuData(anndata_dict)
model = mf.MOFAFLEX(
mdata,
mf.DataOptions(annotations_varm_key= ann_dict),
mf.ModelOptions(n_factors=2, weight_prior="Horseshoe", likelihoods="Normal",
nonnegative_weights=True,
nonnegative_factors=True,
annotation_confidence=0.999,
init_factors=0.0),
mf.TrainingOptions(seed=42, save_path=False, lr=0.01, early_stopper_patience=500)
)
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/mudata/_core/mudata.py:1403: FutureWarning: From 0.4 .update() will not pull obs/var columns from individual modalities by default anymore. Set mudata.set_options(pull_on_update=False) to adopt the new behaviour, which will become the default. Use new pull_obs/pull_var and push_obs/push_var methods for more flexibility. /Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/mudata/_core/mudata.py:1275: FutureWarning: From 0.4 .update() will not pull obs/var columns from individual modalities by default anymore. Set mudata.set_options(pull_on_update=False) to adopt the new behaviour, which will become the default. Use new pull_obs/pull_var and push_obs/push_var methods for more flexibility. 2026-03-31 15:16:23 | [WARNING] Device cuda is not available. Using default device: cpu
2026-03-31 15:16:24 | [WARNING] Could not import dask. Data arrays may be copied, resulting in high memory usage.
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/mudata/_core/mudata.py:1403: FutureWarning: From 0.4 .update() will not pull obs/var columns from individual modalities by default anymore. Set mudata.set_options(pull_on_update=False) to adopt the new behaviour, which will become the default. Use new pull_obs/pull_var and push_obs/push_var methods for more flexibility.
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/mudata/_core/mudata.py:1275: FutureWarning: From 0.4 .update() will not pull obs/var columns from individual modalities by default anymore. Set mudata.set_options(pull_on_update=False) to adopt the new behaviour, which will become the default. Use new pull_obs/pull_var and push_obs/push_var methods for more flexibility.
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:795: UserWarning:
AnnData expects .obs.index to contain strings, but got values like:
['patient_1015_ctrl', 'patient_1015_stim', 'patient_1016_ctrl', 'patient_1016_stim', 'patient_101_ctrl']
Inferred to be: categorical
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:795: UserWarning:
AnnData expects .obs.index to contain strings, but got values like:
['patient_1015_ctrl', 'patient_1015_stim', 'patient_1016_ctrl', 'patient_1016_stim', 'patient_101_ctrl']
Inferred to be: categorical
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:795: UserWarning:
AnnData expects .obs.index to contain strings, but got values like:
['patient_1015_ctrl', 'patient_1015_stim', 'patient_1016_ctrl', 'patient_1016_stim', 'patient_101_ctrl']
Inferred to be: categorical
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:795: UserWarning:
AnnData expects .obs.index to contain strings, but got values like:
['patient_1015_ctrl', 'patient_1015_stim', 'patient_1016_ctrl', 'patient_1016_stim', 'patient_101_ctrl']
Inferred to be: categorical
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:795: UserWarning:
AnnData expects .obs.index to contain strings, but got values like:
['patient_1015_ctrl', 'patient_1015_stim', 'patient_1016_ctrl', 'patient_1016_stim', 'patient_101_ctrl']
Inferred to be: categorical
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:795: UserWarning:
AnnData expects .obs.index to contain strings, but got values like:
['patient_1015_ctrl', 'patient_1015_stim', 'patient_1016_ctrl', 'patient_1016_stim', 'patient_101_ctrl']
Inferred to be: categorical
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:795: UserWarning:
AnnData expects .obs.index to contain strings, but got values like:
['patient_1015_ctrl', 'patient_1015_stim', 'patient_1016_ctrl', 'patient_1016_stim', 'patient_101_ctrl']
Inferred to be: categorical
2026-03-31 15:16:24 | [INFO] Initializing factors using `0.0` method...
0%| | 0/10000 [00:00<?, ?epochs/s]2026-03-31 15:16:24 | [INFO] Guessed max_plate_nesting = 3
88%|āāāāāāāāā | 8849/10000 [09:31<01:17, 14.81epochs/s, Loss=4.64e+4]2026-03-31 15:25:56 | [INFO] Training converged after 8850 epochs.
88%|āāāāāāāāā | 8850/10000 [09:31<01:14, 15.48epochs/s, Loss=4.64e+4]
Outputs of a factor model¶
A factor model decomposes the data into latent factors (capturing sample-level variation) and loadings (capturing feature-level contributions), summarizing the main sources of variability.
This is unchanged from the basic workflow. Guided models produce the same outputs (factor scores, explained variance, and loadings), so all downstream functions in MINA can be applied in the same way.
amodel = mina.down.model_to_anndata(
anndata_dict=anndata_dict,
metadata=metadata,
model=model,
)
In the same way as we mentioned in previous vignettes. An important aspect of guided models is to verify the amount of explained variance associated with the guided factors
# Mean by rows
np.mean(amodel.var,axis=1).sort_values(ascending=False).head(10)
Factor2 0.615652 Factor1 0.431856 Factorup_JAK-STAT 0.034948 Factordown_JAK-STAT 0.002147 dtype: float32
A very limited variability is explained with our guided factors, so interpretation needs to be taken carefully
mf.pl.variance_explained(model)
Associations between pathway activity and stimulated samples¶
All downstream functions of MINA can be used as before. Let's test our hypothesis that the JAK-STAT factor should separate stimulated from non-stimulated samples
mina.down.get_associations(amodel,
test_variable = "label",
test_type="categorical",
random_effect = None)
| feature | statistic | p_value | adj_p_value | |
|---|---|---|---|---|
| 0 | Factor1 | 94.860275 | 1.293903e-07 | 1.725204e-07 |
| 1 | Factor2 | 396.476440 | 1.141794e-11 | 2.283588e-11 |
| 2 | Factorup_JAK-STAT | 615.277893 | 5.706848e-13 | 2.282739e-12 |
| 3 | Factordown_JAK-STAT | 63.080605 | 1.489637e-06 | 1.489637e-06 |
The "up" factor refers to genes upregulated upon pathway activation
sc.pl.violin(amodel,
["Factorup_JAK-STAT"],
groupby="label", rotation=90,size = 3)
Similarly, the "down" factor refers to genes upregulated in not-stimulated samples (downregulated during pathway activation)
sc.pl.violin(amodel,
["Factordown_JAK-STAT"],
groupby="label", rotation=90,size = 3)
As expected the JAK-STAT pathway is more active in the stimulated samples. However we also observe that the other two unguided factors also separate stimulated from non-stimulated samples.
sc.pl.scatter(amodel,
x = "Factor1",
y = "Factor2",
color = "label",
size =100)
Let's correlate the distinct factor scores
mf.pl.factor_correlation(model, figsize=(3,3))
Factor 1 and the guided JAK-STAT up factor highly correlate with each other, as well as the JAK-STAT down factor and Factor 2.
# Characterizing Factor Loadings
# First we need to make the gene loadings a pandas DataFrame with named columns and indexes
# Wrap as DataFrame for readability - This could be a function - users will forget to do this
gene_loadings = pd.DataFrame(amodel.varm["gene_loadings"], columns=amodel.uns['gene_loadings_columns'])
gene_loadings.index = amodel.var.index.to_list()
# Make a dictionary of gene expression data
gene_loadings = mina.down.split_by_view(gene_loadings)
progeny_full = dc.op.progeny(organism="human", top=500)
progeny_pos = progeny_full[progeny_full["weight"] > 0]
progeny_neg = progeny_full[progeny_full["weight"] < 0]
progeny_neg["weight"] = progeny_neg["weight"].abs()
# Now you can run decoupler for each matrix of gene loadings
mcp_pws = mina.down.run_ulm_per_view(view_dict=gene_loadings, net=progeny_pos)
mina.pl.plot_mcell_funcomics(mcp_pws,
p_threshold=0.05,
top_n=15,
use_var=True,
figsize=(18, 4))
/var/folders/bh/sfscsy7s6fn3xzz5nl70w34m0000gp/T/ipykernel_48331/640379121.py:13: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Running ULM for view: B cells Running ULM for view: CD14+ Monocytes Running ULM for view: CD4 T cells Running ULM for view: CD8 T cells Running ULM for view: Dendritic cells Running ULM for view: FCGR3A+ Monocytes Running ULM for view: NK cells
/Users/flores/Dropbox/EBI/Research/MINA/src/mina/pl/pl.py:473: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.
# Get top 10 genes by weight per source
jstat = progeny_full[progeny_full["source"].isin(["JAK-STAT"])]
jstat = jstat[jstat["weight"] > 0]
# Make it absolute and get top 20
jstat["weight"] = jstat["weight"].abs()
top_genes = (
jstat
.groupby("source", group_keys=False)
.apply(lambda x: x.nlargest(10, "weight"))
.reset_index(drop=True)
)
top_genes_list = top_genes["target"].tolist()
# Keep only genes present in pathway_membership index
mina.pl.plot_features_per_view(df_dict=gene_loadings,
features=top_genes_list,
cmap="viridis")
/var/folders/bh/sfscsy7s6fn3xzz5nl70w34m0000gp/T/ipykernel_48331/4060870048.py:10: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning. /Users/flores/Dropbox/EBI/Research/MINA/src/mina/pl/pl.py:755: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.
# Now you can run decoupler for each matrix of gene loadings
mcp_pws_neg = mina.down.run_ulm_per_view(view_dict=gene_loadings, net=progeny_neg)
mina.pl.plot_mcell_funcomics(mcp_pws_neg,
p_threshold=0.05,
top_n=15,
use_var=True,
figsize=(18, 4))
Running ULM for view: B cells Running ULM for view: CD14+ Monocytes Running ULM for view: CD4 T cells Running ULM for view: CD8 T cells Running ULM for view: Dendritic cells Running ULM for view: FCGR3A+ Monocytes Running ULM for view: NK cells
/Users/flores/Dropbox/EBI/Research/MINA/src/mina/pl/pl.py:473: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.
# Get top 10 genes by weight per source
jstat = progeny_full[progeny_full["source"].isin(["JAK-STAT"])]
jstat = jstat[jstat["weight"] < 0]
# Make it absolute and get top 20
jstat["weight"] = jstat["weight"].abs()
top_genes = (
jstat
.groupby("source", group_keys=False)
.apply(lambda x: x.nlargest(30, "weight"))
.reset_index(drop=True)
)
top_genes_list = top_genes["target"].tolist()
# Keep only genes present in pathway_membership index
mina.pl.plot_features_per_view(df_dict=gene_loadings,
features=top_genes_list,
cmap="viridis")
/var/folders/bh/sfscsy7s6fn3xzz5nl70w34m0000gp/T/ipykernel_48331/3112716491.py:10: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning. /Users/flores/Dropbox/EBI/Research/MINA/src/mina/pl/pl.py:755: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.