Multicellular Programs Across Experimental Groups

Multicellular Factor Analysis across datasets with MOFA+ grouping strategy¶

NOTE
This is an advanced user notebook. Familiarity with MOFA-based multicellular factor analysis is assumed.

In many applications, data are collected across multiple studies, cohorts, or experimental batches. These datasets often share underlying biological structure, while also containing dataset-specific effects arising from technical or contextual differences.

While one could use projections to compare distinct batches, a better strategy is to model this explicitly. MOFA+ extends factor analysis to jointly analyze multiple groups of samples, enabling the decomposition of variability into:

Shared factors: capturing variation common across datasets
Group-specific factors: capturing variation unique to each dataset

This grouping strategy allows the model to distinguish reproducible biological signals from dataset-specific effects, without requiring prior data integration or batch correction.

Why use joint modeling?¶

Joint modeling is useful when:

multiple datasets profile similar biological systems
there is interest in identifying reproducible multicellular programs
datasets contain batch or study-specific effects that should be separated from shared biology

In contrast to simple concatenation or integration, MOFA+ provides a structured way to partition variability across datasets, improving interpretability.

Application to Crohn’s disease data¶

To illustrate this, we consider two batches from a single inflammatory bowel disease (IBD) study. Although these batches profile comparable biological samples, they may differ due to technical or experimental factors.

Using MOFA+ grouping, we:

treat each batch as a separate group
jointly model their variation within a shared latent space

This allows us to:

identify shared multicellular programs consistent across batches
isolate batch-specific sources of variation

As a result, we can assess the robustness of inferred programs and distinguish biological signals from technical variability.

In [1]:

Copied!





import mina
import scanpy as sc
import scanpy.external as sce
import numpy as np
import decoupler as dc
import math
import mofaflex as mf
import mudata as md
import pandas as pd
import matplotlib.pyplot as plt
import mina
import scanpy as sc
import scanpy.external as sce
import numpy as np
import decoupler as dc
import math
import mofaflex as mf
import mudata as md
import pandas as pd
import matplotlib.pyplot as plt

/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Importing the dtw module. When using in academic works please cite:
  T. Giorgino. Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package.
  J. Stat. Soft., doi:10.18637/jss.v031.i07.

To illustrate the use of guided models, we apply MINA to a Crohn’s disease dataset. Crohn’s disease is a chronic inflammatory disorder of the gastrointestinal tract, characterized by patchy, transmural inflammation.

Here we re-analyse epithelial cells from the scIBD atlas, focusing on ileal biopsies from Kong et al., 2023. Here we treat the two library preparation protocols as two distinct datasets to integrate.

10x 3' v3¶

In [2]:

Copied!





# Doing pre-processing of dataset number 1
adata = sc.read_h5ad("./data/scIBD/KongEpi_adata_ref.h5ad")
sc.pl.umap(adata, color=['celltype'], frameon=False, ncols=2)
adata = adata[adata.obs["organ"] == "Ileum", :].copy()
# Process adata_raw
metadata = mina.up.extract_metadata_from_obs(obs=adata.obs, groupby="biosample_id", sort=False)
metadata["study_group"] = "KongChemA"
pdata = dc.pp.pseudobulk(adata, sample_col="biosample_id", groups_col="celltype")
# Manipulations on pdata
pdata.obs = pdata.obs.set_index("biosample_id", drop=False)
pdata.obs.index.name = None

# Make list of adatas per pseudobulk cell type
anndata_dict = mina.up.split_anndata_by_celltype(pdata=pdata, grouping="celltype")

# Adding gene expression total counts - TODO add to upstream as a function
for _cell_type, adata_raw in anndata_dict.items():
    # Sum across observations (rows) for each gene (column)
    if hasattr(adata_raw.X, "toarray"):
        # Sparse matrix case
        total_counts = adata_raw.X.sum(axis=0).A1  # returns 1D array
    else:
        total_counts = adata_raw.X.sum(axis=0)  # numpy array

    # Add to .var
    adata_raw.var["total_counts"] = total_counts

# From all the possible samples, let's ask for at least 40%
n_samples = len(metadata.index)
min_samples = math.trunc(n_samples * 0.4)

# Processing
mina.up.filt.filter_anndata_by_ncells(anndata_dict, min_cells=20)
mina.up.filt.filter_genes_byexpr(anndata_dict, min_count=20, min_prop=0.4)
mina.up.filt.filter_views_by_genes(anndata_dict, min_genes_per_view=100)
mina.up.filt.filter_samples_by_coverage(anndata_dict, threshold=0, min_prop=0.90)
mina.up.filt.filter_views_by_samples(anndata_dict, min_rows=min_samples)
mina.up.utils.save_raw_counts(anndata_dict, layer_name="raw_counts")
mina.up.norm_log(anndata_dict, target_sum=1e6, exclude_highly_expressed=False, max_value=None, center=True)
#mina.up.filter_hvgs(anndata_dict, groupby = None, ngroups_cut = None)
mina.up.filter_smpls_by_nview(anndata_dict, min_views = 5)
# Doing pre-processing of dataset number 1
adata = sc.read_h5ad("./data/scIBD/KongEpi_adata_ref.h5ad")
sc.pl.umap(adata, color=['celltype'], frameon=False, ncols=2)
adata = adata[adata.obs["organ"] == "Ileum", :].copy()
# Process adata_raw
metadata = mina.up.extract_metadata_from_obs(obs=adata.obs, groupby="biosample_id", sort=False)
metadata["study_group"] = "KongChemA"
pdata = dc.pp.pseudobulk(adata, sample_col="biosample_id", groups_col="celltype")
# Manipulations on pdata
pdata.obs = pdata.obs.set_index("biosample_id", drop=False)
pdata.obs.index.name = None

# Make list of adatas per pseudobulk cell type
anndata_dict = mina.up.split_anndata_by_celltype(pdata=pdata, grouping="celltype")

# Adding gene expression total counts - TODO add to upstream as a function
for _cell_type, adata_raw in anndata_dict.items():
    # Sum across observations (rows) for each gene (column)
    if hasattr(adata_raw.X, "toarray"):
        # Sparse matrix case
        total_counts = adata_raw.X.sum(axis=0).A1  # returns 1D array
    else:
        total_counts = adata_raw.X.sum(axis=0)  # numpy array

    # Add to .var
    adata_raw.var["total_counts"] = total_counts

# From all the possible samples, let's ask for at least 40%
n_samples = len(metadata.index)
min_samples = math.trunc(n_samples * 0.4)

# Processing
mina.up.filt.filter_anndata_by_ncells(anndata_dict, min_cells=20)
mina.up.filt.filter_genes_byexpr(anndata_dict, min_count=20, min_prop=0.4)
mina.up.filt.filter_views_by_genes(anndata_dict, min_genes_per_view=100)
mina.up.filt.filter_samples_by_coverage(anndata_dict, threshold=0, min_prop=0.90)
mina.up.filt.filter_views_by_samples(anndata_dict, min_rows=min_samples)
mina.up.utils.save_raw_counts(anndata_dict, layer_name="raw_counts")
mina.up.norm_log(anndata_dict, target_sum=1e6, exclude_highly_expressed=False, max_value=None, center=True)
#mina.up.filter_hvgs(anndata_dict, groupby = None, ngroups_cut = None)
mina.up.filter_smpls_by_nview(anndata_dict, min_views = 5)

No description has been provided for this image

/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:795: UserWarning: 
AnnData expects .obs.index to contain strings, but got values like:
    ['H101694_E', 'H101694_L', 'H102141_E', 'H102141_L', 'H110216_E']

    Inferred to be: categorical

Raw counts saved in the 'raw_counts' layer for each AnnData object.
Normalization, log-transformation, and scaling complete for all AnnData objects with target_sum = 1000000.0.

/Users/flores/Dropbox/EBI/Research/MINA/src/mina/up/filt.py:126: RuntimeWarning: invalid value encountered in divide
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/numpy/_core/fromnumeric.py:3824: RuntimeWarning: Mean of empty slice
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/numpy/_core/_methods.py:142: RuntimeWarning: invalid value encountered in scalar divide

10x 3' v2¶

In [3]:

Copied!





# Doing pre-processing of dataset number 2
adata_proj = sc.read_h5ad("./data/scIBD/KongEpi_adata_proj.h5ad")
sc.pl.umap(adata_proj, color=['celltype'], frameon=False, ncols=2)
# Process adata_raw
metadata_proj = mina.up.extract_metadata_from_obs(obs=adata_proj.obs, groupby="biosample_id", sort=False)
metadata_proj["study_group"] = "KongChemB"
pdata_proj = dc.pp.pseudobulk(adata_proj, sample_col="biosample_id", groups_col="celltype")
# Manipulations on pdata
pdata_proj.obs = pdata_proj.obs.set_index("biosample_id", drop=False)
pdata_proj.obs.index.name = None
anndata_dict_proj = mina.up.split_anndata_by_celltype(pdata=pdata_proj, grouping="celltype")

# Adding gene expression total counts - TODO add to upstream as a function
for _cell_type, adata_raw in anndata_dict_proj.items():
    # Sum across observations (rows) for each gene (column)
    if hasattr(adata_raw.X, "toarray"):
        # Sparse matrix case
        total_counts = adata_raw.X.sum(axis=0).A1  # returns 1D array
    else:
        total_counts = adata_raw.X.sum(axis=0)  # numpy array

    # Add to .var
    adata_raw.var["total_counts"] = total_counts

mina.up.filter_anndata_by_ncells(anndata_dict_proj, min_cells=10)
# From all the possible samples, let's ask for at least 40%
n_samples = len(metadata_proj.index)
min_samples = math.trunc(n_samples * 0.4)
mina.up.filter_views_by_samples(anndata_dict_proj, min_rows=min_samples)
mina.up.filter_genes_byexpr(anndata_dict_proj, min_count=5, min_prop=0.4)
mina.up.filter_views_by_genes(anndata_dict_proj, min_genes_per_view=20)
mina.up.filter_samples_by_coverage(anndata_dict_proj, threshold=0, min_prop=0.90)
mina.up.filter_views_by_samples(anndata_dict_proj, min_rows=min_samples)
mina.up.utils.save_raw_counts(anndata_dict_proj, layer_name="raw_counts")
mina.up.norm_log(anndata_dict_proj, target_sum=1e6, exclude_highly_expressed=False, max_value=None, center=True)
#mina.up.filter_hvgs(anndata_dict, groupby = None, ngroups_cut = None)
#mina.up.append_view_to_var(anndata_dict_proj)
# Doing pre-processing of dataset number 2
adata_proj = sc.read_h5ad("./data/scIBD/KongEpi_adata_proj.h5ad")
sc.pl.umap(adata_proj, color=['celltype'], frameon=False, ncols=2)
# Process adata_raw
metadata_proj = mina.up.extract_metadata_from_obs(obs=adata_proj.obs, groupby="biosample_id", sort=False)
metadata_proj["study_group"] = "KongChemB"
pdata_proj = dc.pp.pseudobulk(adata_proj, sample_col="biosample_id", groups_col="celltype")
# Manipulations on pdata
pdata_proj.obs = pdata_proj.obs.set_index("biosample_id", drop=False)
pdata_proj.obs.index.name = None
anndata_dict_proj = mina.up.split_anndata_by_celltype(pdata=pdata_proj, grouping="celltype")

# Adding gene expression total counts - TODO add to upstream as a function
for _cell_type, adata_raw in anndata_dict_proj.items():
    # Sum across observations (rows) for each gene (column)
    if hasattr(adata_raw.X, "toarray"):
        # Sparse matrix case
        total_counts = adata_raw.X.sum(axis=0).A1  # returns 1D array
    else:
        total_counts = adata_raw.X.sum(axis=0)  # numpy array

    # Add to .var
    adata_raw.var["total_counts"] = total_counts

mina.up.filter_anndata_by_ncells(anndata_dict_proj, min_cells=10)
# From all the possible samples, let's ask for at least 40%
n_samples = len(metadata_proj.index)
min_samples = math.trunc(n_samples * 0.4)
mina.up.filter_views_by_samples(anndata_dict_proj, min_rows=min_samples)
mina.up.filter_genes_byexpr(anndata_dict_proj, min_count=5, min_prop=0.4)
mina.up.filter_views_by_genes(anndata_dict_proj, min_genes_per_view=20)
mina.up.filter_samples_by_coverage(anndata_dict_proj, threshold=0, min_prop=0.90)
mina.up.filter_views_by_samples(anndata_dict_proj, min_rows=min_samples)
mina.up.utils.save_raw_counts(anndata_dict_proj, layer_name="raw_counts")
mina.up.norm_log(anndata_dict_proj, target_sum=1e6, exclude_highly_expressed=False, max_value=None, center=True)
#mina.up.filter_hvgs(anndata_dict, groupby = None, ngroups_cut = None)
#mina.up.append_view_to_var(anndata_dict_proj)

/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:795: UserWarning: 
AnnData expects .obs.index to contain strings, but got values like:
    ['H106265_N', 'H139073_N', 'H158108_N1', 'H158108_N2', 'H158108_N3']

    Inferred to be: categorical

Raw counts saved in the 'raw_counts' layer for each AnnData object.
Normalization, log-transformation, and scaling complete for all AnnData objects with target_sum = 1000000.0.

Merging multiple studies across views¶

This function merges multiple studies into a unified set of AnnData objects, where each study is represented as a dictionary of views (e.g. cell types, modalities, or compartments).

In MINA we provide a wrapper that allows to align distinct datasets in a single multi-view structure. We assume that datasets are somehow homogenized in annotations, since it will only keep similar things.

In other words,ensure both your views and metadata contain similar information.

The merging can look complicated but merge operates at two levels:

View selection (which views to retain across studies)
Feature selection (which genes to retain within each merged view)

1. View selection (view_mode)¶

Determines which views are retained before merging:

Union
Keep all views present in at least one study
Intersection
Keep only views shared across all studies
Minimum number (min_n)
Keep views present in at least a specified number of studies (min_view_studies)

This controls the coverage vs comparability trade-off at the level of views.

2. Feature selection (var_mode)¶

After selecting a view, the corresponding AnnData objects are concatenated across studies.
var_mode determines which genes/features are retained:

Outer
Keep all genes present in any study (union)
Inner
Keep only genes shared across all studies (intersection)
Minimum number (min_n)
Keep genes present in at least a specified number of studies (min_var_studies)

This controls the coverage vs consistency trade-off at the feature level.

3. Merging procedure¶

For each retained view:

collect AnnData objects from studies containing that view
annotate cells with their study of origin
concatenate cells across studies
retain only shared metadata columns
subset genes according to the selected feature mode

The result is one AnnData object per view with:

rows: cells from all contributing studies
columns: selected genes

4. Practical configurations¶

Conservative merge
Keep only shared views and shared genes across all studies
→ maximizes comparability, reduces coverage
Permissive merge
Keep all views and all genes across studies
→ maximizes coverage, may introduce sparsity and heterogeneity
Balanced approach
Keep views and genes observed in at least a minimum number of studies
→ balances robustness and coverage, often suitable for multi-study data

In [4]:

Copied!





# Merging datasets
anndata_merged = mina.up.merge_adata_views(
    studies = [anndata_dict, anndata_dict_proj],
    study_names = ["ChemA", "ChemB"],
    view_mode = "intersection",
    min_view_studies = 2,
    var_mode = "inner",
    min_var_studies = 2
)

mina.up.utils.append_view_to_var(anndata_merged)
# Merging datasets
anndata_merged = mina.up.merge_adata_views(
    studies = [anndata_dict, anndata_dict_proj],
    study_names = ["ChemA", "ChemB"],
    view_mode = "intersection",
    min_view_studies = 2,
    var_mode = "inner",
    min_var_studies = 2
)

mina.up.utils.append_view_to_var(anndata_merged)

In addition, all the metadata needs to be combined to use downstream analyses

In [5]:

Copied!

metadata_merged = pd.concat([metadata, metadata_proj], join="inner", axis=0)
metadata_merged = pd.concat([metadata, metadata_proj], join="inner", axis=0)

Finally, and VERY IMPORTANTLY add your group information in the .obs of the mudata we will use for modeling

In [6]:

Copied!

model_data = md.MuData(anndata_merged)
model_data.obs["study_group"] = model_data.obs.index.map(metadata_merged["study_group"])
model_data = md.MuData(anndata_merged)
model_data.obs["study_group"] = model_data.obs.index.map(metadata_merged["study_group"])

/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/mudata/_core/mudata.py:1403: FutureWarning: From 0.4 .update() will not pull obs/var columns from individual modalities by default anymore. Set mudata.set_options(pull_on_update=False) to adopt the new behaviour, which will become the default. Use new pull_obs/pull_var and push_obs/push_var methods for more flexibility.
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/mudata/_core/mudata.py:1275: FutureWarning: From 0.4 .update() will not pull obs/var columns from individual modalities by default anymore. Set mudata.set_options(pull_on_update=False) to adopt the new behaviour, which will become the default. Use new pull_obs/pull_var and push_obs/push_var methods for more flexibility.

Fitting the model with MOFA-FLEX¶

Here, we use MOFA-FLEX to incorporate a group-level covariate (study_group) and build a group model.

As a result, the model is configured with:

5 unguided factors capturing the variation of the two studies and all cell-types simultaneously

Guided modeling requires specifying one parameter in DataOptions, which is the "study_group"

In [7]:

Copied!





# Fitting model
model_group = mf.terms.MofaFlex(n_factors=5, weight_prior="SpikeSlab", init_factors="random")
model_group.fit(model_data, seed=42, save_path=False, 
                lr=0.01, early_stopper_patience=1000, group_by="study_group", 
                likelihoods="Normal")
# Fitting model
model_group = mf.terms.MofaFlex(n_factors=5, weight_prior="SpikeSlab", init_factors="random")
model_group.fit(model_data, seed=42, save_path=False, 
                lr=0.01, early_stopper_patience=1000, group_by="study_group", 
                likelihoods="Normal")

/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/mudata/_core/mudata.py:1403: FutureWarning: From 0.4 .update() will not pull obs/var columns from individual modalities by default anymore. Set mudata.set_options(pull_on_update=False) to adopt the new behaviour, which will become the default. Use new pull_obs/pull_var and push_obs/push_var methods for more flexibility.
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/mudata/_core/mudata.py:1275: FutureWarning: From 0.4 .update() will not pull obs/var columns from individual modalities by default anymore. Set mudata.set_options(pull_on_update=False) to adopt the new behaviour, which will become the default. Use new pull_obs/pull_var and push_obs/push_var methods for more flexibility.
2026-05-26 15:48:49 | [WARNING] Device cuda is not available. Using default device: cpu

2026-05-26 15:48:50 | [WARNING] Could not import dask. Data arrays may be copied, resulting in high memory usage.
2026-05-26 15:48:50 | [INFO] Initializing factors using 'random' method...
  0%|          | 0/10000 [00:00<?, ?epoch/s]2026-05-26 15:48:50 | [INFO] Guessed max_plate_nesting = 2
 29%|██▊       | 2851/10000 [04:03<10:07, 11.76epoch/s, Loss=3.06e+4]2026-05-26 15:52:53 | [INFO] Training converged after 2852 epochs.
 29%|██▊       | 2852/10000 [04:03<10:10, 11.72epoch/s, Loss=3.06e+4]

Outputs of joint multicellular factor analysis (MOFA+)¶

Joint modeling with MOFA+ produces the same core outputs as standard factor models, but extends them to explicitly capture shared and group-specific variation across datasets.

1. Factor scores - shared between groups¶

Matrix of size (samples × factors), defined for each group (e.g. study or batch)
Represent the position of samples in the shared latent space

Interpretation:

Shared factors: comparable across groups → enable direct comparison of samples from different datasets
Group-specific factors: defined within a single group → capture within-dataset variation

2. Explained variance (per factor and per group)¶

Quantifies how much variance each factor explains
Computed:
- per group (dataset-specific contribution)
- optionally per view (cell type / modality)

Interpretation:

Shared factors → explain variance across multiple groups
Group-specific factors → explain variance restricted to one group

This allows direct assessment of whether a factor reflects reproducible biology or dataset-specific effects.

3. Factor loadings (per view) - shared between groups¶

Matrix (features × factors) for each view (e.g. cell type)
Define how features (genes) contribute to each factor

Interpretation:

Shared factors often show consistent loadings across groups
Group-specific factors may reveal dataset-specific gene programs

Loadings remain the main tool for biological interpretation of factors.

In MINA we have taken these differences into account when building the anndata use for downstream analyses

In [8]:

Copied!





amodel = mina.down.model_to_anndata(
    anndata_dict=anndata_merged, # Add the merged anndata here
    metadata=metadata_merged, # Add the merged metadata here
    model=model_group,
)
amodel = mina.down.model_to_anndata(
    anndata_dict=anndata_merged, # Add the merged anndata here
    metadata=metadata_merged, # Add the merged metadata here
    model=model_group,
)

The difference in this model is that in .var we store the explained variance per view and group separately

In [9]:

Copied!

amodel.var
amodel.var

Out[9]:

	Cycling TAs:KongChemA	Cycling TAs:KongChemB	Goblet cells:KongChemA	Goblet cells:KongChemB	Ileal enterocytes:KongChemA	Ileal enterocytes:KongChemB	Paneth cells:KongChemA	Paneth cells:KongChemB	Stem cells:KongChemA	Stem cells:KongChemB	Transit amplifying cells (TAs):KongChemA	Transit amplifying cells (TAs):KongChemB
Factor1	0.096956	0.015544	0.137926	0.038304	0.056276	0.129225	0.000000	0.424353	0.054572	0.028010	0.075502	0.026503
Factor2	0.028057	0.138526	0.042696	0.124478	0.043372	0.137565	0.024529	0.015624	0.030962	0.148181	0.031854	0.155399
Factor3	0.073853	0.098757	0.036737	0.088070	0.052247	0.060642	0.038295	0.002394	0.062813	0.094715	0.073949	0.070552
Factor4	0.182955	0.102605	0.142140	0.121865	0.050991	0.170546	0.000000	0.176521	0.124744	0.106558	0.116959	0.054711
Factor5	0.072588	0.090996	0.047768	0.073994	0.037203	0.036653	0.075044	0.000000	0.096668	0.128060	0.075884	0.106527

You can visualize this without the need of MINA using MOFA-FLEX's functions

In [10]:

Copied!

mf.pl.variance_explained(model_group)
mf.pl.variance_explained(model_group)

Out[10]:

Worth noting that this way of modeling allows to identify factors that are more representative of a group (e.g. Factor 2 for ChemB)

Doing statistics with the models¶

Given the availability of multiple groups one can do associations controling for random effects, following the same principles as in other tutorials

In [12]:

Copied!





mina.down.get_associations(amodel, 
                           test_variable = "disease",
                           test_type="categorical", 
                           random_effect = "study_group")
mina.down.get_associations(amodel, 
                           test_variable = "disease",
                           test_type="categorical", 
                           random_effect = "study_group")

/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/statsmodels/regression/mixed_linear_model.py:2237: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/statsmodels/regression/mixed_linear_model.py:2261: ConvergenceWarning: The Hessian matrix at the estimated parameter values is not positive definite.
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/statsmodels/regression/mixed_linear_model.py:1634: UserWarning: Random effects covariance is singular
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/statsmodels/regression/mixed_linear_model.py:2237: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/statsmodels/regression/mixed_linear_model.py:1634: UserWarning: Random effects covariance is singular
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/statsmodels/regression/mixed_linear_model.py:2237: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/statsmodels/regression/mixed_linear_model.py:2261: ConvergenceWarning: The Hessian matrix at the estimated parameter values is not positive definite.
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/statsmodels/regression/mixed_linear_model.py:1634: UserWarning: Random effects covariance is singular
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/statsmodels/regression/mixed_linear_model.py:2237: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/statsmodels/regression/mixed_linear_model.py:2261: ConvergenceWarning: The Hessian matrix at the estimated parameter values is not positive definite.
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/statsmodels/regression/mixed_linear_model.py:1634: UserWarning: Random effects covariance is singular
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/statsmodels/regression/mixed_linear_model.py:2237: ConvergenceWarning: The MLE may be on the boundary of the parameter space.
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/statsmodels/regression/mixed_linear_model.py:2261: ConvergenceWarning: The Hessian matrix at the estimated parameter values is not positive definite.

Out[12]:

	feature	statistic	p_value	adj_p_value
0	Factor1	-0.156404	0.875715	0.875715
1	Factor2	1.785310	0.074211	0.185528
2	Factor3	1.094689	0.273653	0.342066
3	Factor4	-2.679332	0.007377	0.036885
4	Factor5	1.174447	0.240216	0.342066

We find an association with disease in Factor 4, now we can check how this factor looks for individual studies

In [13]:

Copied!

sc.pl.violin(amodel[amodel.obs["study_group"] == "KongChemA"],
             ["Factor4"], 
             groupby="disease", rotation=90,size = 3)
sc.pl.violin(amodel[amodel.obs["study_group"] == "KongChemA"],
             ["Factor4"], 
             groupby="disease", rotation=90,size = 3)

/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:1208: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
... storing 'donor_id' as categorical
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:1208: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
... storing 'organ' as categorical
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:1208: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
... storing 'library_preparation_protocol' as categorical
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:1208: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
... storing 'biosample_enriched_fraction' as categorical
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:1208: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
... storing 'study_group' as categorical

In [14]:

Copied!

sc.pl.violin(amodel[amodel.obs["study_group"] == "KongChemB"],
             ["Factor4"], 
             groupby="disease", rotation=90,size = 3)
sc.pl.violin(amodel[amodel.obs["study_group"] == "KongChemB"],
             ["Factor4"], 
             groupby="disease", rotation=90,size = 3)

/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:1208: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
... storing 'donor_id' as categorical
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:1208: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
... storing 'organ' as categorical
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:1208: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
... storing 'library_preparation_protocol' as categorical
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:1208: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
... storing 'biosample_enriched_fraction' as categorical
/Users/flores/Dropbox/EBI/Research/MINA_docs/.venv/lib/python3.12/site-packages/anndata/_core/anndata.py:1208: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
... storing 'study_group' as categorical

Here the interpretation has to consider the explained variance per cell type and group.

In this example, Factor 4 associates with disease status regardless of the library preparation protocol. However, when examining the explained variance per group. Factor 4 captures variability in Paneth cells in one of the chemistries only. Our model is able to pick shared variance in other cells and protocols within this Factor, but one can assume that this disease signal is stronger in one cell type and chemistry.

As in the other tutorials, one can inspect the molecular functions behind the factors of interest.

In [17]:

Copied!





# Characterizing Factor Loadings
# First we need to make the gene loadings a pandas DataFrame with named columns and indexes

# Wrap as DataFrame for readability - This could be a function - users will forget to do this
gene_loadings = pd.DataFrame(amodel.varm["gene_loadings"], columns=amodel.uns['gene_loadings_columns'])
gene_loadings.index = amodel.var.index.to_list()

# Make a dictionary of gene expression data
gene_loadings = mina.down.split_by_view(gene_loadings)

# Now you can run decoupler for each matrix of gene loadings
progeny = dc.op.progeny(organism="human")
mcp_pws = mina.down.run_ulm_per_view(view_dict=gene_loadings, net=progeny)

mina.pl.plot_mcell_funcomics(mcp_pws, 
                     p_threshold=0.05,
                     top_n=15,
                     use_var=True,
                     figsize=(18, 4))
# Characterizing Factor Loadings
# First we need to make the gene loadings a pandas DataFrame with named columns and indexes

# Wrap as DataFrame for readability - This could be a function - users will forget to do this
gene_loadings = pd.DataFrame(amodel.varm["gene_loadings"], columns=amodel.uns['gene_loadings_columns'])
gene_loadings.index = amodel.var.index.to_list()

# Make a dictionary of gene expression data
gene_loadings = mina.down.split_by_view(gene_loadings)

# Now you can run decoupler for each matrix of gene loadings
progeny = dc.op.progeny(organism="human")
mcp_pws = mina.down.run_ulm_per_view(view_dict=gene_loadings, net=progeny)

mina.pl.plot_mcell_funcomics(mcp_pws, 
                     p_threshold=0.05,
                     top_n=15,
                     use_var=True,
                     figsize=(18, 4))

Running ULM for view: Transit amplifying cells (TAs)
Running ULM for view: Ileal enterocytes
Running ULM for view: Cycling TAs
Running ULM for view: Stem cells
Running ULM for view: Paneth cells
Running ULM for view: Goblet cells

/Users/flores/Dropbox/EBI/Research/MINA/src/mina/pl/pl.py:474: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.

Focusing on Factor 4, we observe inflammatory response in transit amplifying and stem cells. The difference with the single chemistry model is the genes included in the full model.

Summary¶

In this vignette, we applied group-based multicellular factor analysis to model coordinated sample-level variation across epithelial cell types in inflammatory bowel disease across different technical experimental groups.

The key difference from the introductory tutorial is the explicit modeling of shared variance across cell-types and experimental groups, which allows to identify shared and unique signals of distinct groups.

This type of modeling is the basis for using factor analysis for meta-analysis

The next vignettes focus on a different type of modeling, where we want to guide factors based on feature-level prior knowledge.