Multicellular Factor Analysis with MOFA

The basics: MOFA-based multicellular factor analysis¶

This vignette introduces the core MINA workflow using an unsupervised multicellular factor model based on MOFA. We use epithelial single-cell data from inflammatory bowel disease to build sample-level multicellular views, learn latent factors across cell types, and interpret these factors as multicellular programs.

In this dataset, the main source of sample-level variability is the enrichment of biopsies for either lamina propria or epithelial compartments. We therefore use this notebook to show how MINA can recover dominant tissue-level structure before moving on to subtler disease-associated programs.

Learning goals

Build multicellular views from single-cell data by aggregating cells into sample-level pseudobulk profiles.
Fit an unsupervised multicellular factor model across cell types.
Interpret the dominant factors with respect to tissue compartment and disease.

This is the first notebook in the MINA vignette series.

Start here if you want the core unsupervised workflow.
Continue to the SOFA vignette if you want to model known sample-level drivers explicitly.
Continue to the groups vignette if you want to model multiple datasets jointly.
Continue to the functional and functional (guided) vignette if you want to use prior knowledge to reconstruct multicellular programs

Unless you are already familiarized with the use of factor models for multicellular integration, we encourage you to follow this order

Some definitions first:¶

Multicellular Factor Analysis (MCFA) is a statistical framework for analyzing single-cell data at the level of whole tissues rather than individual cell types.

Most single-cell approaches analyze cell types independently. MCFA instead models how multiple cell types vary together across samples (e.g. patients), capturing coordinated, tissue-level biology.

Core idea¶

MFA treats each cell type (or tissue feature) as a separate view of the same sample and applies group factor analysis (e.g. via MOFA) to learn a shared latent space.

Input: per-sample summaries for each cell type (e.g. pseudobulk expression)
Model: joint factorization across multiple cell-type views
Output: latent factors interpreted as multicellular programs

These programs represent coordinated gene expression changes across cell types, rather than cell-intrinsic effects.

What it provides¶

Tissue-centric representation
Samples (patients) are embedded in a low-dimensional space capturing multicellular variation
Multicellular programs
Interpretable factors linking coordinated activity across cell types
Unsupervised analysis
No need to predefine contrasts (e.g. case vs control)

Conceptual shift¶

Instead of asking:

What changes within each cell type?

MFA asks:

What coordinated patterns across cell types explain differences between samples?

This reframes single-cell data as a multiview description of tissues, where biological signals emerge from inter-cell-type coordination.

In [1]:

Copied!





import mina
import scanpy as sc
import scanpy.external as sce
import numpy as np
import decoupler as dc
import math
import mofaflex as mf
import mudata as md
import pandas as pd
import matplotlib.pyplot as plt
import mina
import scanpy as sc
import scanpy.external as sce
import numpy as np
import decoupler as dc
import math
import mofaflex as mf
import mudata as md
import pandas as pd
import matplotlib.pyplot as plt

/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Importing the dtw module. When using in academic works please cite:
  T. Giorgino. Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package.
  J. Stat. Soft., doi:10.18637/jss.v031.i07.

Biological context: epithelial multicellular programs in inflammatory bowel disease¶

To illustrate how Multicellular Factor Analysis can be used to dissect the main sources of variation across whole tissue samples, we apply MINA to a Crohn’s disease dataset. Crohn’s disease is a chronic inflammatory disorder of the gastrointestinal tract, characterized by patchy, transmural inflammation.

We analyse epithelial cells from the scIBD atlas, focusing on ileal biopsies from Kong et al., 2023. For this tutorial, we restrict the analysis to a single library preparation protocol; integration across cohorts and technologies is covered in later tutorials.

In this study, Crohn’s disease patients and controls were profiled using scRNA-seq. For patients, biopsies were obtained from both inflamed and non-inflamed regions of the ileum. Each sample is derived from a specific tissue compartment (e.g. epithelial layer or lamina propria), allowing us to focus on epithelial variation across patients and conditions.

These samples define the unit of analysis, while epithelial cell populations are treated as views summarizing tissue composition.

In [2]:

Copied!

adata = sc.read_h5ad("./data/scIBD/KongEpi_adata_ref.h5ad")
adata = adata[adata.obs["organ"] == "Ileum", :].copy()
sc.pl.umap(adata, color=["disease", 'celltype', "biosample_enriched_fraction", "tissue_sample_status"], frameon=False, ncols=2)
adata = sc.read_h5ad("./data/scIBD/KongEpi_adata_ref.h5ad")
adata = adata[adata.obs["organ"] == "Ileum", :].copy()
sc.pl.umap(adata, color=["disease", 'celltype', "biosample_enriched_fraction", "tissue_sample_status"], frameon=False, ncols=2)

No description has been provided for this image

Build multicellular views from pseudobulked samples¶

We first convert the single-cell object into a multiview representation. Each biosample becomes one observation, and each cell type becomes one view. Within each view, gene expression is summarized as a pseudobulk profile for that sample and cell type.

This representation is the key input for MINA: it preserves cell-type-resolved information while shifting the analysis to the sample level, where tissue-wide multicellular coordination can be modeled.

In [3]:

Copied!





# We extract the metadata based on our column that is the reference
# Filter adata to only include ileum
# Process adata_raw
metadata = mina.up.extract_metadata_from_obs(obs=adata.obs, groupby="biosample_id", sort=False)
pdata = dc.pp.pseudobulk(adata, sample_col="biosample_id", groups_col="celltype")
# Manipulations on pdata
pdata.obs = pdata.obs.set_index("biosample_id", drop=False)
pdata.obs.index.name = None

# Make list of adatas per pseudobulk cell type
anndata_dict = mina.up.split_anndata_by_celltype(pdata=pdata, grouping="celltype")

# Adding gene expression total counts - TODO add to upstream as a function
for _cell_type, adata_raw in anndata_dict.items():
    # Sum across observations (rows) for each gene (column)
    if hasattr(adata_raw.X, "toarray"):
        # Sparse matrix case
        total_counts = adata_raw.X.sum(axis=0).A1  # returns 1D array
    else:
        total_counts = adata_raw.X.sum(axis=0)  # numpy array

    # Add to .var
    adata_raw.var["total_counts"] = total_counts
# We extract the metadata based on our column that is the reference
# Filter adata to only include ileum
# Process adata_raw
metadata = mina.up.extract_metadata_from_obs(obs=adata.obs, groupby="biosample_id", sort=False)
pdata = dc.pp.pseudobulk(adata, sample_col="biosample_id", groups_col="celltype")
# Manipulations on pdata
pdata.obs = pdata.obs.set_index("biosample_id", drop=False)
pdata.obs.index.name = None

# Make list of adatas per pseudobulk cell type
anndata_dict = mina.up.split_anndata_by_celltype(pdata=pdata, grouping="celltype")

# Adding gene expression total counts - TODO add to upstream as a function
for _cell_type, adata_raw in anndata_dict.items():
    # Sum across observations (rows) for each gene (column)
    if hasattr(adata_raw.X, "toarray"):
        # Sparse matrix case
        total_counts = adata_raw.X.sum(axis=0).A1  # returns 1D array
    else:
        total_counts = adata_raw.X.sum(axis=0)  # numpy array

    # Add to .var
    adata_raw.var["total_counts"] = total_counts

/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:795: UserWarning: 
AnnData expects .obs.index to contain strings, but got values like:
    ['H101694_E', 'H101694_L', 'H102141_E', 'H102141_L', 'H110216_E']

    Inferred to be: categorical

Choose filtering thresholds before fitting the model¶

Before model fitting, it is important to define minimal coverage requirements for each view.

Here we inspect:

the minimum number of cells contributing to each pseudobulk profile
the minimum number of biosamples represented in each cell-type view

These thresholds are dataset dependent, but the main goal is the same: keep views that are sufficiently represented across samples while avoiding very sparse or unstable inputs.

Using decoupler-py we can inspect the number of cells and read counts of each pseudobulk profile to guide our filtering thresholds.

This is an example for a single cell-type (Tuft cells):

In [4]:

Copied!





# An example of a single cell-type
dc.pl.filter_samples(anndata_dict["Tuft cells"], groupby=["disease"], min_cells=20, min_counts=1000, figsize=(4, 4))
plt.gcf().suptitle(f"Tuft cells", fontsize=14, fontweight="bold")
plt.show()
# An example of a single cell-type
dc.pl.filter_samples(anndata_dict["Tuft cells"], groupby=["disease"], min_cells=20, min_counts=1000, figsize=(4, 4))
plt.gcf().suptitle(f"Tuft cells", fontsize=14, fontweight="bold")
plt.show()

We then can apply a filtering to all cell-types

In [5]:

Copied!

# Making filters
mina.up.filt.filter_anndata_by_ncells(anndata_dict, min_cells=20)
# Making filters
mina.up.filt.filter_anndata_by_ncells(anndata_dict, min_cells=20)

Similarly we can use helping plotting functions to judge how many biosamples do we include in each view

In [6]:

Copied!





# From all the possible samples, let's ask for at least 40%
n_samples = len(metadata.index)
min_samples = math.trunc(n_samples * 0.4)
mina.pl.plot_view_samples(anndata_dict, min_samples=min_samples)
# From all the possible samples, let's ask for at least 40%
n_samples = len(metadata.index)
min_samples = math.trunc(n_samples * 0.4)
mina.pl.plot_view_samples(anndata_dict, min_samples=min_samples)

/Users/flores/Dropbox/EBI/Research/MINA/src/mina/pl/pl.py:53: RuntimeWarning: divide by zero encountered in log10

In [7]:

Copied!

# From all the possible samples, let's ask for at least 40%
mina.up.filt.filter_views_by_samples(anndata_dict, min_rows=min_samples)
# From all the possible samples, let's ask for at least 40%
mina.up.filt.filter_views_by_samples(anndata_dict, min_rows=min_samples)

Filter genes within each view¶

We next define which genes are informative enough to retain in each cell-type view. Genes that are rarely detected or have negligible counts across samples tend to add noise without contributing useful structure to the latent model.

The aim of this step is not to aggressively reduce dimensionality, but to remove features that are too sparse to support stable multicellular factor analysis.

In [8]:

Copied!

dc.pl.filter_by_expr(anndata_dict["Goblet cells"], min_count=20, min_total_count=20, large_n=0, min_prop=0.4)
plt.gcf().suptitle(f"Goblet cells", fontsize=14, fontweight="bold")
plt.show()
dc.pl.filter_by_expr(anndata_dict["Goblet cells"], min_count=20, min_total_count=20, large_n=0, min_prop=0.4)
plt.gcf().suptitle(f"Goblet cells", fontsize=14, fontweight="bold")
plt.show()

Final preprocessing before model fitting¶

After sample- and gene-level filtering, we complete the preprocessing pipeline by:

removing views with too few retained genes
filtering samples with low feature coverage (i.e. from 100 genes, a sample covers only a percentage)
storing raw counts
normalizing the data (log-normalization)
selecting highly variable genes - reusing scanpy strategies - default "Seurat"
retaining samples observed across a sufficient number of views

These steps prepare a consistent multiview object for multiview-based modeling.

In [9]:

Copied!





mina.up.filt.filter_genes_byexpr(anndata_dict, min_count=20, min_prop=0.4)
mina.up.filt.filter_views_by_genes(anndata_dict, min_genes_per_view=100)
mina.up.filt.filter_samples_by_coverage(anndata_dict, threshold=0, min_prop=0.90)
mina.up.filt.filter_views_by_samples(anndata_dict, min_rows=min_samples)
mina.up.utils.save_raw_counts(anndata_dict, layer_name="raw_counts")
mina.up.norm_log(anndata_dict, target_sum=1e6, exclude_highly_expressed=False, max_value=None, center=True)
mina.up.filter_hvgs(anndata_dict, groupby = None, ngroups_cut = None)
mina.up.filter_smpls_by_nview(anndata_dict, min_views = 5)
mina.up.filt.filter_genes_byexpr(anndata_dict, min_count=20, min_prop=0.4)
mina.up.filt.filter_views_by_genes(anndata_dict, min_genes_per_view=100)
mina.up.filt.filter_samples_by_coverage(anndata_dict, threshold=0, min_prop=0.90)
mina.up.filt.filter_views_by_samples(anndata_dict, min_rows=min_samples)
mina.up.utils.save_raw_counts(anndata_dict, layer_name="raw_counts")
mina.up.norm_log(anndata_dict, target_sum=1e6, exclude_highly_expressed=False, max_value=None, center=True)
mina.up.filter_hvgs(anndata_dict, groupby = None, ngroups_cut = None)
mina.up.filter_smpls_by_nview(anndata_dict, min_views = 5)

Raw counts saved in the 'raw_counts' layer for each AnnData object.
Normalization, log-transformation, and scaling complete for all AnnData objects with target_sum = 1000000.0.

An important filtering you may want to apply is to filter views by specific genes. This is implemented in up.filter_genes_by_celltype(), where you provide the anndata_dict and a named list with the same views as in your anndata dictionary and each key contains the set of genes to be filtered out. This is mainly useful to avoid including background genes

In [10]:

Copied!

anndata_dict
anndata_dict

Out[10]:

{'Cycling TAs': AnnData object with n_obs × n_vars = 45 × 1267
     obs: 'biosample_id', 'celltype', 'donor_id', 'organ', 'library_preparation_protocol', 'study', 'disease', 'tissue_sample_status', 'celltype_compartment', 'biosample_enriched_fraction', 'psbulk_cells', 'psbulk_counts'
     var: 'features', 'total_counts', 'mean', 'std'
     uns: 'log1p', 'hvg'
     layers: 'psbulk_props', 'raw_counts',
 'Goblet cells': AnnData object with n_obs × n_vars = 46 × 1372
     obs: 'biosample_id', 'celltype', 'donor_id', 'organ', 'library_preparation_protocol', 'study', 'disease', 'tissue_sample_status', 'celltype_compartment', 'biosample_enriched_fraction', 'psbulk_cells', 'psbulk_counts'
     var: 'features', 'total_counts', 'mean', 'std'
     uns: 'log1p', 'hvg'
     layers: 'psbulk_props', 'raw_counts',
 'Ileal enterocytes': AnnData object with n_obs × n_vars = 33 × 1901
     obs: 'biosample_id', 'celltype', 'donor_id', 'organ', 'library_preparation_protocol', 'study', 'disease', 'tissue_sample_status', 'celltype_compartment', 'biosample_enriched_fraction', 'psbulk_cells', 'psbulk_counts'
     var: 'features', 'total_counts', 'mean', 'std'
     uns: 'log1p', 'hvg'
     layers: 'psbulk_props', 'raw_counts',
 'Paneth cells': AnnData object with n_obs × n_vars = 40 × 338
     obs: 'biosample_id', 'celltype', 'donor_id', 'organ', 'library_preparation_protocol', 'study', 'disease', 'tissue_sample_status', 'celltype_compartment', 'biosample_enriched_fraction', 'psbulk_cells', 'psbulk_counts'
     var: 'features', 'total_counts', 'mean', 'std'
     uns: 'log1p', 'hvg'
     layers: 'psbulk_props', 'raw_counts',
 'Stem cells': AnnData object with n_obs × n_vars = 42 × 1165
     obs: 'biosample_id', 'celltype', 'donor_id', 'organ', 'library_preparation_protocol', 'study', 'disease', 'tissue_sample_status', 'celltype_compartment', 'biosample_enriched_fraction', 'psbulk_cells', 'psbulk_counts'
     var: 'features', 'total_counts', 'mean', 'std'
     uns: 'log1p', 'hvg'
     layers: 'psbulk_props', 'raw_counts',
 'Transit amplifying cells (TAs)': AnnData object with n_obs × n_vars = 45 × 1230
     obs: 'biosample_id', 'celltype', 'donor_id', 'organ', 'library_preparation_protocol', 'study', 'disease', 'tissue_sample_status', 'celltype_compartment', 'biosample_enriched_fraction', 'psbulk_cells', 'psbulk_counts'
     var: 'features', 'total_counts', 'mean', 'std'
     uns: 'log1p', 'hvg'
     layers: 'psbulk_props', 'raw_counts'}

Fit the multicellular factor model¶

We now convert the processed views into a MuData object and fit an unsupervised MOFA model through MOFAFLEX. The resulting latent factors summarize coordinated expression changes across cell types and samples.

For this introductory vignette, we use a compact model with five factors to obtain an interpretable overview of the main multicellular axes of variation.

NOTE To avoid issues during model fit, we have provided a helping function append_view_to_var that adds a prefix to each feature and makes it suitable for all downstream functions in MINA.

In [11]:

Copied!





# This is necessary before any model fit!
mina.up.utils.append_view_to_var(anndata_dict)

mdata = md.MuData(anndata_dict)
model = mf.MOFAFLEX(
    mdata,
    mf.ModelOptions(n_factors=5, weight_prior="SnS", likelihoods="Normal"),
    mf.TrainingOptions(seed=42, save_path=False, lr=0.01, early_stopper_patience=500)
)
# This is necessary before any model fit!
mina.up.utils.append_view_to_var(anndata_dict)

mdata = md.MuData(anndata_dict)
model = mf.MOFAFLEX(
    mdata,
    mf.ModelOptions(n_factors=5, weight_prior="SnS", likelihoods="Normal"),
    mf.TrainingOptions(seed=42, save_path=False, lr=0.01, early_stopper_patience=500)
)

/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/mudata/_core/mudata.py:1403: FutureWarning: From 0.4 .update() will not pull obs/var columns from individual modalities by default anymore. Set mudata.set_options(pull_on_update=False) to adopt the new behaviour, which will become the default. Use new pull_obs/pull_var and push_obs/push_var methods for more flexibility.
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/mudata/_core/mudata.py:1275: FutureWarning: From 0.4 .update() will not pull obs/var columns from individual modalities by default anymore. Set mudata.set_options(pull_on_update=False) to adopt the new behaviour, which will become the default. Use new pull_obs/pull_var and push_obs/push_var methods for more flexibility.
2026-03-27 11:52:04 | [WARNING] Device cuda is not available. Using default device: cpu

2026-03-27 11:52:04 | [WARNING] Could not import dask. Data arrays may be copied, resulting in high memory usage.
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/mudata/_core/mudata.py:1403: FutureWarning: From 0.4 .update() will not pull obs/var columns from individual modalities by default anymore. Set mudata.set_options(pull_on_update=False) to adopt the new behaviour, which will become the default. Use new pull_obs/pull_var and push_obs/push_var methods for more flexibility.
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/mudata/_core/mudata.py:1275: FutureWarning: From 0.4 .update() will not pull obs/var columns from individual modalities by default anymore. Set mudata.set_options(pull_on_update=False) to adopt the new behaviour, which will become the default. Use new pull_obs/pull_var and push_obs/push_var methods for more flexibility.
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:795: UserWarning: 
AnnData expects .obs.index to contain strings, but got values like:
    ['H101694_L', 'H102141_E', 'H102141_L', 'H110216_L', 'H152638_L']

    Inferred to be: categorical

/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:795: UserWarning: 
AnnData expects .obs.index to contain strings, but got values like:
    ['H101694_L', 'H102141_E', 'H102141_L', 'H110216_L', 'H152638_L']

    Inferred to be: categorical

/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:795: UserWarning: 
AnnData expects .obs.index to contain strings, but got values like:
    ['H101694_L', 'H102141_E', 'H102141_L', 'H152638_L', 'H199129_L']

    Inferred to be: categorical

/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:795: UserWarning: 
AnnData expects .obs.index to contain strings, but got values like:
    ['H101694_L', 'H102141_E', 'H110216_L', 'H152638_L', 'H199129_L']

    Inferred to be: categorical

/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:795: UserWarning: 
AnnData expects .obs.index to contain strings, but got values like:
    ['H101694_L', 'H102141_E', 'H102141_L', 'H110216_L', 'H152638_L']

    Inferred to be: categorical

/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:795: UserWarning: 
AnnData expects .obs.index to contain strings, but got values like:
    ['H101694_L', 'H102141_E', 'H102141_L', 'H110216_L', 'H152638_L']

    Inferred to be: categorical

2026-03-27 11:52:04 | [INFO] Initializing factors using `random` method...
  0%|          | 0/10000 [00:00<?, ?epochs/s]2026-03-27 11:52:04 | [INFO] Guessed max_plate_nesting = 3
 22%|██▏       | 2173/10000 [01:17<04:11, 31.13epochs/s, Loss=1.03e+4]2026-03-27 11:53:22 | [INFO] Training converged after 2174 epochs.
 22%|██▏       | 2174/10000 [01:17<04:40, 27.90epochs/s, Loss=1.03e+4]

Outputs of a factor model¶

A factor model decomposes data into latent factors (capturing sample-level variation) and loadings (capturing feature-level contributions), summarizing the main sources of variability.

1. Factor scores (sample representation)¶

Matrix of size (samples × factors)
Each value indicates how strongly a sample activates a given factor
Interpreted relatively: samples with similar scores share similar underlying structure

Factor scores define a low-dimensional embedding of samples, capturing major axes of variation.

2. Explained variance (importance of factors)¶

Quantifies how much variability is captured by each factor
Can be computed globally or per view / cell-type

Used to:

rank factors by importance
distinguish dominant vs minor sources of variation

3. Factor loadings (feature weights)¶

Matrix of size (features × factors)
Each value measures how strongly a feature (e.g. gene) contributes to a factor

Interpretation:

values ≈ 0 → weak or no contribution
large absolute values → strong contribution
sign → direction of association

Loadings provide the biological interpretation of factors, linking latent variation to features.

Convert the fitted model into an interoperable `AnnData`¶

To simplify downstream analysis, we convert the fitted model into an AnnData object that integrates latent factors, metadata, and feature-level information in a format compatible with the scverse ecosystem.

In the resulting object:

.X stores factor scores for each sample
.obs stores sample-level metadata
.var stores factor-level information
.varm stores gene loadings
.obsm stores additional view-level outputs generated during model conversion

In [12]:

Copied!





amodel = mina.down.model_to_anndata(
    anndata_dict=anndata_dict,
    metadata=metadata,
    model=model,
)
amodel = mina.down.model_to_anndata(
    anndata_dict=anndata_dict,
    metadata=metadata,
    model=model,
)

Identify the dominant source of variability¶

One of first questions we can ask to our model is related to the known covariates associated with our factors. This helps you describe the variability of your biosamples in simpler terms.

For this we have implemented in mina a simple wrapper of statistical analyses that takes an anndata as input and allows you to perform continuous (linear models) or categorical associations (ANOVAs) between the factor scores and any variable in .obs. We provide also the chance to perform mixed-effect models where you can define a random effect.

We first test whether the learned factors are associated with biosample_enriched_fraction, which distinguishes lamina propria-enriched from epithelial-enriched biopsies.

This is the main signal we expect to recover in this dataset. Establishing that point is important, because it shows that the leading multicellular factors are capturing major tissue-compartment structure rather than disease alone.

In [13]:

Copied!





fraction_stats = mina.down.get_associations(amodel, 
                           test_variable = "biosample_enriched_fraction",
                           test_type="categorical",
                           random_effect = None)

print(fraction_stats)
fraction_stats = mina.down.get_associations(amodel, 
                           test_variable = "biosample_enriched_fraction",
                           test_type="categorical",
                           random_effect = None)

print(fraction_stats)

   feature  statistic   p_value  adj_p_value
0  Factor1   1.345792  0.252271     0.315339
1  Factor2   5.520700  0.023345     0.038908
2  Factor3  12.514044  0.000966     0.002415
3  Factor4  21.218348  0.000035     0.000175
4  Factor5   0.814921  0.371585     0.371585

In this case we have three factors that associate with differences in the enriched fraction of the sample.

Standard Scanpy visualizations can then be used to inspect the factor space directly. In this case, we use them to confirm that the major separation between samples is driven by biosample_enriched_fraction. For factors with significant associations

In [14]:

Copied!

sc.pl.violin(amodel, 
             ["Factor2","Factor3","Factor4"], 
             groupby="biosample_enriched_fraction", rotation=90,size = 3)
sc.pl.violin(amodel, 
             ["Factor2","Factor3","Factor4"], 
             groupby="biosample_enriched_fraction", rotation=90,size = 3)

Similarly, we can easily visualize the factor space

In [15]:

Copied!





sc.pl.scatter(amodel,
              x = "Factor3",
              y = "Factor4", 
              color  = "biosample_enriched_fraction", 
              size =100)
sc.pl.scatter(amodel,
              x = "Factor3",
              y = "Factor4", 
              color  = "biosample_enriched_fraction", 
              size =100)

We can also inspect the variance explained by each latent factor across views using MOFAFLEX's functions. This provides a complementary summary of how strongly each factor contributes to the multiview structure learned by the model.

In [16]:

Copied!

mf.pl.variance_explained(model)
mf.pl.variance_explained(model)

Out[16]:

In this case, we can do claims such as: "Factor 3 recovers in a larger extent the variability of Cycling TAs"

To move from factor-level associations to view-level interpretation, we can estimate the total variance associated with a variable of interest. This helps identify which cell types contribute most strongly to the observed compartment-driven variation.

Basically: Which cell type seems to capture more the differences between enriched compartments?

In [17]:

Copied!

mina.down.calc_total_variance(amodel, fraction_stats, pval_thrs=0.05) # Using the statistics from above
mina.down.calc_total_variance(amodel, fraction_stats, pval_thrs=0.05) # Using the statistics from above

Out[17]:

{'Cycling TAs': feature
 group_1    0.384087
 dtype: float32,
 'Goblet cells': feature
 group_1    0.275133
 dtype: float32,
 'Ileal enterocytes': feature
 group_1    0.171948
 dtype: float32,
 'Paneth cells': feature
 group_1    0.242652
 dtype: float32,
 'Stem cells': feature
 group_1    0.338876
 dtype: float32,
 'Transit amplifying cells (TAs)': feature
 group_1    0.344069
 dtype: float32}

In this case we observed cycling TAs, TAs, and stem cells as the cell types whose variation associates the most with enriched fraction

mina also provides functions to summarize associations across multiple metadata variables at once. This is useful for contrasting dominant and secondary sources of variation in the same fitted model.

In [ ]:

Copied!





# Get the p-value matrix for the associations of interest (adjusted)
all_assocs = mina.down.get_pval_matrix(amodel,["tissue_sample_status", "disease", "biosample_enriched_fraction"])
mina.pl.pl.plot_pval_tiles(all_assocs, title="MOFA 5 factors - P-value matrix",star_threshold=0.05)
plt.show()
# Get the p-value matrix for the associations of interest (adjusted)
all_assocs = mina.down.get_pval_matrix(amodel,["tissue_sample_status", "disease", "biosample_enriched_fraction"])
mina.pl.pl.plot_pval_tiles(all_assocs, title="MOFA 5 factors - P-value matrix",star_threshold=0.05)
plt.show()

Characterise a secondary disease-associated factor¶

Once the dominant compartment effect has been established, we can inspect additional factors associated with disease. In this example, Factor5 captures a secondary axis of variation linked to inflammatory state.

We therefore use Factor5 to illustrate how factor loadings can be interpreted biologically across cell types.

In [19]:

Copied!

sc.pl.violin(amodel, 
             ["Factor5"], 
             groupby="disease", rotation=90,size = 3)
sc.pl.violin(amodel, 
             ["Factor5"], 
             groupby="disease", rotation=90,size = 3)

Here, negative values of Factor5 correspond to stronger activation in Crohn's disease samples, whereas positive values correspond to higher activity in healthy samples or relative downregulation in disease.

To interpret this factor, we extract gene loadings across views and summarize them with pathway-level enrichment analysis. This allows us to translate a latent multicellular factor into coordinated biological programs across epithelial cell types.

In [20]:

Copied!





# Characterizing Factor Loadings
# First we need to make the gene loadings a pandas DataFrame with named columns and indexes

# Wrap as DataFrame for readability - This could be a function - users will forget to do this
gene_loadings = pd.DataFrame(amodel.varm["gene_loadings"], columns=amodel.uns['gene_loadings_columns'])
gene_loadings.index = amodel.var.index.to_list()

# Make a dictionary of gene expression data
gene_loadings = mina.down.split_by_view(gene_loadings)
# Characterizing Factor Loadings
# First we need to make the gene loadings a pandas DataFrame with named columns and indexes

# Wrap as DataFrame for readability - This could be a function - users will forget to do this
gene_loadings = pd.DataFrame(amodel.varm["gene_loadings"], columns=amodel.uns['gene_loadings_columns'])
gene_loadings.index = amodel.var.index.to_list()

# Make a dictionary of gene expression data
gene_loadings = mina.down.split_by_view(gene_loadings)

In [26]:

Copied!

gene_loadings["Cycling TAs"]
gene_loadings["Cycling TAs"]

Out[26]:

feature	7SK-ENSG00000260682	AAAS	AADAC	AAMP	ABHD14B	ABHD17A	ABHD17C	ACAA1	ACAP2	ACAT1	...	ZNF511	ZNF524	ZNF622	ZNF638	ZNF644	ZNF787	ZNHIT1	ZNHIT6	ZWILCH	ZWINT
Factor1	0.001392	0.475870	-0.002397	0.027284	0.162002	0.372084	-0.235553	-0.003974	0.347659	0.249128	...	-0.799564	-0.530414	-0.052572	0.522516	0.305381	0.002603	0.535748	0.000058	0.484871	-0.007446
Factor2	0.490200	0.026656	-0.030896	0.582488	0.003021	0.162017	0.000303	0.134971	0.033339	0.711946	...	-0.253944	0.540531	0.705261	0.440159	0.024282	0.620651	-0.514821	0.468154	-0.000752	-0.019539
Factor3	-0.538292	-0.452985	0.425257	-0.069367	0.672377	-0.646826	-0.554977	-0.483363	-0.125783	-0.059511	...	-0.123973	0.331857	-0.803141	-0.499673	-0.335234	-0.198446	0.000428	-0.996697	-1.258811	-0.331185
Factor4	-0.098823	-0.469970	-0.148880	-0.191082	-0.313662	0.430406	-1.102156	0.808390	-0.146083	-0.589405	...	0.796380	0.650362	0.441467	0.000777	-0.927431	0.341908	0.262192	-0.393100	-0.737649	0.715396
Factor5	-0.371543	-1.008295	-0.036869	0.770032	-0.077031	0.673065	0.734859	0.463891	0.545790	-0.703389	...	0.183183	-0.829557	0.626042	0.938458	0.736765	0.730909	0.504760	0.483019	-0.321423	0.100240

5 rows × 1267 columns

Note that the converted gene loadings now look like a expression matrix that we can use to perform regular enrichment analyses as presented in decoupler.

In this example we will use PROGENy which collect perturbation-response genes of 14 distinct signaling pathways. By enriching these signatures into gene loadings, we are able to infer signaling pathway activities.

In MINA we provide a wrapper that allows you to perform this across all cell-types

In [27]:

Copied!

# Now you can run decoupler for each matrix of gene loadings
progeny = dc.op.progeny(organism="human")
mcp_pws = mina.down.run_ulm_per_view(view_dict=gene_loadings, net=progeny)
# Now you can run decoupler for each matrix of gene loadings
progeny = dc.op.progeny(organism="human")
mcp_pws = mina.down.run_ulm_per_view(view_dict=gene_loadings, net=progeny)

Running ULM for view: Cycling TAs
Running ULM for view: Goblet cells
Running ULM for view: Ileal enterocytes
Running ULM for view: Paneth cells
Running ULM for view: Stem cells
Running ULM for view: Transit amplifying cells (TAs)

We also provide some helper plotting functions. They take as input your enrichment analyses and show you the top significant enriched pathways per view.

In [22]:

Copied!





mina.pl.plot_mcell_funcomics(mcp_pws, 
                     p_threshold=0.05,
                     top_n=15,
                     use_var=True,
                     figsize=(18, 4))
mina.pl.plot_mcell_funcomics(mcp_pws, 
                     p_threshold=0.05,
                     top_n=15,
                     use_var=True,
                     figsize=(18, 4))

/Users/flores/Dropbox/EBI/Research/MINA/src/mina/pl/pl.py:473: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.

In this example, the disease-associated Factor 5 highlights inflammatory pathway activity in ileal enterocyte populations in Crohn's disease patients.

Note that we can claim this, because the activity is blue (negative) and we know that Crohn's disease patients have lower factor 5 scores.

This illustrates how MINA can move from latent factors to biologically interpretable multicellular programs.

Infer multicellular information networks¶

Multicellular programs summarize coordinated variation across cell types, but they do not directly describe pairwise dependencies between them. To explore this structure further, MINA can derive multicellular information networks from a selected factor. The simple interpretation of this is that now we will use our multicellular program to ask how much of the information of a cell type, allows you to explain the information of the rest.

Here we use the disease-associated factor to estimate directed relationships between cell-type signatures across samples. The resulting network provides a complementary view of how multicellular coordination may be organized within the tissue.

TECHNICAL DESCRIPTION

Here we build a cell-cell dependency network with directed edges that describe to what extent the molecular profile of one cell-type could predict the profile of each other cell-type. First, we generated cell-type signature matrices per patient sample and study, by enriching the cell-type-specific gene weights of a given factor into the pseudobulk profiles of their respective cell-type. Enrichment was performed with linear models as implemented in decoupleR. To estimate the weight of the incoming edges of the multicellular coordination networks, we built linear mixed models to predict the signatures of a specific cell type with the signatures of the rest. In these models, a covariate of interest can be treated as a random effect. For a given predicted cell-type, the coefficient estimate of each predictor cell-type multiplied by the overall fit of the model was used as the final edge weight. This is done for positive and negative loadings independently. The details are provided in Lanzer, Ramirez, et. al. 2025.

In [23]:

Copied!





MINs = mina.down.get_multicell_net(
    test_model = amodel,
    sel_factor = "Factor5",
    random_effect = None,
    standardize = True,
    drop_na = True,
    verbose = True,
)
MINs = mina.down.get_multicell_net(
    test_model = amodel,
    sel_factor = "Factor5",
    random_effect = None,
    standardize = True,
    drop_na = True,
    verbose = True,
)

Here in addition, we focus on the negative set of gene loadings, given that we know that Crohn's disease patients have lower activities of Factor 5

In [24]:

Copied!





graph_viz = mina.pl.plot_mcell_network(
    df = MINs["negative"],
    weight_col="cor_estimate",
    abs_cutoff=0.2,
    keep_negative=False,
    title="Coordination network - Crohn's disease",
    show_edge_labels = False,
    save_path=None,
    reciprocal_curvature=0.3,
    default_curvature=0.4)
graph_viz = mina.pl.plot_mcell_network(
    df = MINs["negative"],
    weight_col="cor_estimate",
    abs_cutoff=0.2,
    keep_negative=False,
    title="Coordination network - Crohn's disease",
    show_edge_labels = False,
    save_path=None,
    reciprocal_curvature=0.3,
    default_curvature=0.4)

Summary¶

In this vignette, we used unsupervised multicellular factor analysis to model coordinated sample-level variation across epithelial cell types in inflammatory bowel disease.

The main conclusion is that the strongest latent structure is associated with biosample_enriched_fraction, separating lamina propria-enriched and epithelial-enriched biopsies. After accounting for that dominant axis, additional factors reveal disease-associated multicellular programs that can be interpreted through gene loadings, pathway activity, and cell-type coordination networks.

Use this workflow when you want an unsupervised overview of multicellular structure within a single dataset. If a known sample-level variable dominates the latent space and should be modeled explicitly, the next vignette shows how to do this with SOFA.