mistyR and data formats

Ricardo Omar Ramirez Flores

Heidelberg University and Heidelberg University Hospital, Heidelberg, Germany

Jovan Tanevski

Heidelberg University and Heidelberg University Hospital, Heidelberg, Germany
Jožef Stefan Institute, Ljubljana, Slovenia
jovan.tanevski@uni-heidelberg.de

2024-03-25

Source: vignettes/mistyDataFormats.Rmd

mistyDataFormats.Rmd

Introduction

mistyR can be used to analyze spatial omics data sets stored in SpatialExperiment, SeuratObject or anndata object with just a couple of functions. In this vignette we demonstrate how to build a user friendly workflow starting from data preprocessing, through running mistyR, to analysis of results, focusing on functional representation of 10x Visium data.

First load packages needed for the analysis.

# MISTy
library(mistyR)
library(future)

# data manipulation
library(Matrix)
library(tibble)
library(dplyr)
library(purrr)

# normalization
library(sctransform)

# resource
library(decoupleR)

# plotting
library(ggplot2)

# setup parallel execution
plan(multisession)

As an example, we will analyze a 10X Visium spatial gene expression dataset of one breast cancer section (Block A Section 1) available here [https://support.10xgenomics.com/spatial-gene-expression/datasets]. For convenience, we make this dataset available as object in several popular formats including SpatialExperiment, SeuratObject or anndata and demonstrate how to extract the data needed for further analysis from these objects.

We will explore the spatial interactions of the Hypoxia pathway responsive genes with the Estrogen pathway responsive genes. To this end we will convert the count based representation of the data to a representation based on pathway specific genes and/or estimated pathway activities. To this end we will use the package decoupleR.

Get the data

SpatialExperiment

To load the packages needed to work with SpatialExperiment data and download the object.

# SpatialExperiment
library(SpatialExperiment)
#> Loading required package: SingleCellExperiment
#> Loading required package: SummarizedExperiment
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#> 
#> Attaching package: 'matrixStats'
#> The following object is masked from 'package:dplyr':
#> 
#>     count
#> 
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#> 
#>     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#>     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#>     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#>     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#>     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#>     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#>     colWeightedMeans, colWeightedMedians, colWeightedSds,
#>     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#>     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#>     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#>     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#>     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#>     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#>     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#>     rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:dplyr':
#> 
#>     combine, intersect, setdiff, union
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
#>     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
#>     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
#>     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#>     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
#>     table, tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:dplyr':
#> 
#>     first, rename
#> The following objects are masked from 'package:Matrix':
#> 
#>     expand, unname
#> The following object is masked from 'package:utils':
#> 
#>     findMatches
#> The following objects are masked from 'package:base':
#> 
#>     expand.grid, I, unname
#> Loading required package: IRanges
#> 
#> Attaching package: 'IRanges'
#> The following object is masked from 'package:purrr':
#> 
#>     reduce
#> The following objects are masked from 'package:dplyr':
#> 
#>     collapse, desc, slice
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> 
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#> 
#>     rowMedians
#> The following objects are masked from 'package:matrixStats':
#> 
#>     anyMissing, rowMedians
library(SingleCellExperiment)
library(SummarizedExperiment)


download.file("https://www.dropbox.com/scl/fi/7mdmz6vk10ib55qn7w3fw/visium_spe.rds?rlkey=j03qamdm9zcin577chlabms0m&dl=1",
    destfile = "visium_spe.rds", mode ="wb", quiet = TRUE
  )

Next, read the object and extract the expression and location data.

spe.vs <- readRDS("visium_spe.rds")


# Expression data
expression <- counts(spe.vs)

# deal with duplicate names
symbols <- rowData(spe.vs)$symbol
d.index <- which(duplicated(symbols))
symbols[d.index] <- paste0(symbols[d.index],".1")
rownames(expression) <- symbols

# Location data
geometry <- as.data.frame(colData(spe.vs)) %>%
  select(array_row, array_col)

colnames(geometry) <- c("row", "col")

Seurat

To load the packages needed to work with SeuratObject data and download the object.

# Seurat
library(Seurat)

download.file("https://www.dropbox.com/scl/fi/44zf4le1xcq7ichjp11bg/visium_seurat.rds?rlkey=ikrrsp2rncqde0nnsbdte1joa&dl=1",
    destfile = "visium_seurat.rds", mode ="wb", quiet = TRUE
  )

Next, read the object and extract the expression and location data.

seurat.vs <- readRDS("visium_seurat.rds")


# Expression data
expression <- GetAssayData(
  object = seurat.vs,
  slot = "counts",
  assay = "Spatial"
)

# Seurat deals with duplicates internally in similar way as above

# Location data
geometry <- GetTissueCoordinates(seurat.vs,
    cols = c("row", "col"), scale = NULL
  )

AnnData

To load the packages needed to work with SeuratObject data and download the object.

# AnnData
library(anndata)

download.file("https://www.dropbox.com/scl/fi/jubijl0pr8rhka8mfjpcn/visium_anndata.h5ad?rlkey=xmhmfl5oz61dgngmackkklp32&dl=1",
    destfile = "visium_anndata.h5ad", mode ="wb", quiet = TRUE
  )

Next, read the object and extract the expression and location data.

anndata.vs <- read_h5ad("visium_anndata.h5ad")


# Expression data
# Here the dgRMatrix is converted to a dense matrix for vst compatibility reasons
expression <- t(as.matrix(anndata.vs$X))

# deal with duplicate names (alternatively see AnnData.var_names_make_unique)
symbols <- rownames(expression)
d.index <- which(duplicated(symbols))
symbols[d.index] <- paste0(symbols[d.index],".1")
rownames(expression) <- symbols

# Location data
geometry <- anndata.vs$obs[,c("array_row", "array_col")]
colnames(geometry) <- c("row", "col")

Preprocess the data

From here on, the analysis proceeds in the same way no matter the input format.

Normalize counts and drop duplicates

In this example we normalize the counts using vst normalization. However, the user must define what’s the best solution for their analysis.

norm.data <- vst(as(expression, "dgCMatrix"), verbosity = 0)$y

Filtering genes that are expressed in at least 5% of the spots

coverage <- rowSums(norm.data > 0) / ncol(norm.data)
slide.markers <- names(coverage[coverage >= 0.05])

Run MISTy on pathway specific genes

In this use casewe would like to dissect the relationships between Estrogen and Hypoxia responsive genes coming from two spatial contexts: Relationships within the spot and relationships in the broader tissue structure. In particular, as intrinsic representation we will use the normalized counts of Hypoxia responsive genes. We will explore the relationships between the Hypoxia responsive genes at the context of a Visium spot, the relationships between Hypoxia responsive genes in the broader tissue structure and the relationships between the Estrogen responsive genes and Hypoxia responsive genes in the broader tissue structure.

Defining Estrogen and Hypoxia responsive genes

For this simple example we will pick the top 15 most significantly responsive genes of each pathway from the model matrix from the resource progeny available from the package decoupleR.

resource <- get_progeny(organism ="human", top = 15)

estrogen.footprints <- resource %>%
  filter(source == "Estrogen", weight != 0, target %in% slide.markers) %>% 
  pull(target)

hypoxia.footprints <- resource %>%
  filter(source == "Hypoxia", weight != 0, target %in% slide.markers) %>% 
  pull(target)

View composition

To capture the relationships of interest within the descrbed contexts, our MISTy model will consist of three views. First we construct a Hypoxia specific intraview (capturing the expression within a spot) and add a Hypoxia specific paraview (capturing the expression in the broader tissue structure) with a significance radius of 5 spots.

hypoxia.views <- create_initial_view(t(norm.data)[, hypoxia.footprints] %>% as_tibble()) %>%
  add_paraview(geometry, l=5)
#> 
#> Generating paraview

We will next create a similar view composition but using Estrogen reponsive genes. This is an easy way to generate a view that will capture the expression of Estrogen responsive genes in the broader tissue structure that we will add to the previously generated view composition.

estrogen.views <- create_initial_view(t(norm.data)[,estrogen.footprints] %>% as_tibble()) %>%
  add_paraview(geometry, l=5)
#> 
#> Generating paraview

We next combine the view composition in a composition capturing all relationships of interest: Hypoxia intraview + Hypoxia paraview + Estrogen paraview.

misty.views <- hypoxia.views %>% add_views(create_view("paraview.estrogen.5", estrogen.views[["paraview.5"]]$data, "para.estrogen.5"))

Run MISTy and collect results

run_misty(misty.views, "vignette_model_footprints")
#> [1] "/home/runner/work/mistyR/mistyR/vignettes/vignette_model_footprints"

misty.results <- collect_results("vignette_model_footprints")

Interpretation and downstream analysis

MISTy gives explanatory answers to three general questions:

1. How much can the broader spatial context explain the expression of markers (in contrast to the intraview)?

This can be observed in the gain in R2 (or RMSE) of using the multiview model in contrast to the single main view model.

misty.results %>%
  plot_improvement_stats("gain.R2") %>%
  plot_improvement_stats("gain.RMSE")

In this example, PGK1 is a marker whose expression can be explained better by modeling the broader spatial context around each spot.

We can further inspect the significance of the gain in variance explained, by the assigned p-value of improvement based on cross-validation.

misty.results$improvements %>%
  filter(measure == "p.R2") %>%
  arrange(value)
#> # A tibble: 15 × 4
#>    target  sample                                                measure   value
#>    <chr>   <chr>                                                 <chr>     <dbl>
#>  1 PGK1    /home/runner/work/mistyR/mistyR/vignettes/vignette_m… p.R2    4.57e-4
#>  2 ANKZF1  /home/runner/work/mistyR/mistyR/vignettes/vignette_m… p.R2    8.78e-4
#>  3 PDK1    /home/runner/work/mistyR/mistyR/vignettes/vignette_m… p.R2    3.14e-3
#>  4 FUT11   /home/runner/work/mistyR/mistyR/vignettes/vignette_m… p.R2    1.56e-2
#>  5 FAM162A /home/runner/work/mistyR/mistyR/vignettes/vignette_m… p.R2    3.11e-2
#>  6 HILPDA  /home/runner/work/mistyR/mistyR/vignettes/vignette_m… p.R2    3.24e-2
#>  7 INSIG2  /home/runner/work/mistyR/mistyR/vignettes/vignette_m… p.R2    7.65e-2
#>  8 ENO2    /home/runner/work/mistyR/mistyR/vignettes/vignette_m… p.R2    1.09e-1
#>  9 NDRG1   /home/runner/work/mistyR/mistyR/vignettes/vignette_m… p.R2    1.11e-1
#> 10 EGLN1   /home/runner/work/mistyR/mistyR/vignettes/vignette_m… p.R2    1.26e-1
#> 11 PFKFB4  /home/runner/work/mistyR/mistyR/vignettes/vignette_m… p.R2    1.79e-1
#> 12 BNIP3L  /home/runner/work/mistyR/mistyR/vignettes/vignette_m… p.R2    2.00e-1
#> 13 KDM3A   /home/runner/work/mistyR/mistyR/vignettes/vignette_m… p.R2    2.25e-1
#> 14 ANKRD37 /home/runner/work/mistyR/mistyR/vignettes/vignette_m… p.R2    2.82e-1
#> 15 GBE1    /home/runner/work/mistyR/mistyR/vignettes/vignette_m… p.R2    8.57e-1

In general, the significant gain in R2 can be interpreted as the following:

“We can better explain the expression of marker X, when we consider additional views, other than the intrinsic view.”

2.How much do different view components contribute to explaining the expression?

misty.results %>% plot_view_contributions()


misty.results$contributions.stats %>% filter(target == "PGK1")
#> # A tibble: 3 × 6
#>   target view             mean fraction p.mean  p.sd
#>   <chr>  <chr>           <dbl>    <dbl>  <dbl> <dbl>
#> 1 PGK1   intra           0.513    0.375      0    NA
#> 2 PGK1   para.5          0.325    0.237      0    NA
#> 3 PGK1   para.estrogen.5 0.531    0.388      0    NA

In the case of PGK1, we observe that around 37.7% of the contribution in the final model comes from the expression of other markers of hypoxia intrinsically or from the broader tissue structure. The rest (62.3%) comes from the expression of estrogen and hypoxia responsive genes from the broader tissue structure.

3.What are the specific relations that can explain the contributions?

To explain the contributions, we can visualize the importances of markers coming from each view separately as predictors of the expression of the intrinsic markers of hypoxia.

First, the intrinsic importances of the hypoxia markers.

misty.results %>% plot_interaction_heatmap(view = "intra")

These importances are associated to the relationship between markers in the same spot. Let’s pick the best predictor of PGK1 to confirm this:

misty.results$importances.aggregated %>% 
  filter(view == "intra", Target == "PGK1") %>%
  arrange(-Importance)
#> # A tibble: 15 × 5
#>    view  Predictor Target Importance nsamples
#>    <chr> <chr>     <chr>       <dbl>    <int>
#>  1 intra NDRG1     PGK1       2.45          1
#>  2 intra FAM162A   PGK1       1.97          1
#>  3 intra HILPDA    PGK1       0.345         1
#>  4 intra PFKFB4    PGK1       0.203         1
#>  5 intra ENO2      PGK1       0.0588        1
#>  6 intra EGLN1     PGK1      -0.339         1
#>  7 intra KDM3A     PGK1      -0.483         1
#>  8 intra GBE1      PGK1      -0.486         1
#>  9 intra ANKRD37   PGK1      -0.528         1
#> 10 intra ANKZF1    PGK1      -0.565         1
#> 11 intra BNIP3L    PGK1      -0.579         1
#> 12 intra FUT11     PGK1      -0.646         1
#> 13 intra INSIG2    PGK1      -0.674         1
#> 14 intra PDK1      PGK1      -0.733         1
#> 15 intra PGK1      PGK1      NA             1


vis.data <- cbind(geometry, t(norm.data)[,hypoxia.footprints], t(norm.data)[,estrogen.footprints])

ggplot(vis.data, aes(x=col, y=row, color = PGK1)) + geom_point() + theme_void()

ggplot(vis.data, aes(x=col, y=row, color = NDRG1)) + geom_point() + theme_void()

Second, the paraview importances of the hypoxia markers.

misty.results %>% plot_interaction_heatmap(view = "para.5")

These importances are associated to the relationship between markers in the spot and markers in the neighborhood (controlled by our parameter l).

ggplot(vis.data, aes(x=col, y=row, color = PGK1)) + geom_point() + theme_void()

ggplot(vis.data, aes(x=col, y=row, color = EGLN1)) + geom_point() + theme_void()

As expected, the expression of EGLN1 (a predictor with hign importance from this view) in the neighborhood of each spot allows to explain the expression of PGK1.

Finally, the paraview importances of the estrogen markers. We will inspect the best predictor in this view.

misty.results %>% plot_interaction_heatmap(view = "para.estrogen.5")

ggplot(vis.data, aes(x=col, y=row, color = PGK1)) + geom_point() + theme_void()

ggplot(vis.data, aes(x=col, y=row, color = TPD52L1)) + geom_point() + theme_void()

It is visible that in some areas the local expression of TPD52L1 overlaps with the areas with the highest expression of PGK1.

Important notes

The relationships captured in the importances are not to assumed or interpreted as linear or casual.
1-to-1 importances between predictor and markers should always be interpreted in the context of the other predictors, since training MISTy models is multivariate predictive task.

Other use cases

The shown example is not the only way to use mistyR to analyze spatial transcriptomics data. Similar and complementary workflows can be constructed to describe different aspects of biology, for example:

Spatial interactions between pathway activities and putative ligands, as shown here.
Spatial interactions between cell-state lineage markers and putative ligands, as shown here.
Spatial interactions between cell-type abundances leveraging deconvolution methods and creating descriptions of cell colocalization and tissue architecture.

Additionally, mistyR through the function collect_results() allows you to group the results of multiple slides, allowing for a more robust, integrative or comparative analysis of spatial interactions.

Session info

Here is the output of sessionInfo() at the point when this document was compiled:

#> R version 4.3.3 (2024-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] SpatialExperiment_1.12.0    SingleCellExperiment_1.24.0
#>  [3] SummarizedExperiment_1.32.0 Biobase_2.62.0             
#>  [5] GenomicRanges_1.54.1        GenomeInfoDb_1.38.8        
#>  [7] IRanges_2.36.0              S4Vectors_0.40.2           
#>  [9] BiocGenerics_0.48.1         MatrixGenerics_1.14.0      
#> [11] matrixStats_1.2.0           ggplot2_3.5.0              
#> [13] decoupleR_2.8.0             sctransform_0.4.1          
#> [15] purrr_1.0.2                 dplyr_1.1.4                
#> [17] tibble_3.2.1                Matrix_1.6-5               
#> [19] future_1.33.1               mistyR_1.10.0              
#> [21] BiocStyle_2.30.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] RColorBrewer_1.1-3      jsonlite_1.8.8          magrittr_2.0.3         
#>   [4] magick_2.8.3            farver_2.1.1            rmarkdown_2.26         
#>   [7] fs_1.6.3                zlibbioc_1.48.2         ragg_1.3.0             
#>  [10] vctrs_0.6.5             memoise_2.0.1           RCurl_1.98-1.14        
#>  [13] htmltools_0.5.7         S4Arrays_1.2.1          progress_1.2.3         
#>  [16] curl_5.2.1              cellranger_1.1.0        SparseArray_1.2.4      
#>  [19] sass_0.4.9              parallelly_1.37.1       bslib_0.6.2            
#>  [22] desc_1.4.3              plyr_1.8.9              lubridate_1.9.3        
#>  [25] cachem_1.0.8            igraph_2.0.3            lifecycle_1.0.4        
#>  [28] pkgconfig_2.0.3         R6_2.5.1                fastmap_1.1.1          
#>  [31] GenomeInfoDbData_1.2.11 digest_0.6.35           selectr_0.4-2          
#>  [34] colorspace_2.1-0        furrr_0.3.1             textshaping_0.3.7      
#>  [37] filelock_1.0.3          labeling_0.4.3          fansi_1.0.6            
#>  [40] timechange_0.3.0        httr_1.4.7              abind_1.4-5            
#>  [43] compiler_4.3.3          bit64_4.0.5             withr_3.0.0            
#>  [46] backports_1.4.1         logger_0.3.0            OmnipathR_3.10.1       
#>  [49] highr_0.10              R.utils_2.12.3          MASS_7.3-60.0.1        
#>  [52] rappdirs_0.3.3          DelayedArray_0.28.0     rjson_0.2.21           
#>  [55] tools_4.3.3             future.apply_1.11.1     R.oo_1.26.0            
#>  [58] glue_1.7.0              grid_4.3.3              checkmate_2.3.1        
#>  [61] reshape2_1.4.4          generics_0.1.3          gtable_0.3.4           
#>  [64] tzdb_0.4.0              R.methodsS3_1.8.2       tidyr_1.3.1            
#>  [67] data.table_1.15.2       hms_1.1.3               xml2_1.3.6             
#>  [70] utf8_1.2.4              XVector_0.42.0          pillar_1.9.0           
#>  [73] stringr_1.5.1           vroom_1.6.5             later_1.3.2            
#>  [76] lattice_0.22-5          bit_4.0.5               tidyselect_1.2.1       
#>  [79] knitr_1.45              gridExtra_2.3           bookdown_0.38          
#>  [82] xfun_0.42               stringi_1.8.3           yaml_2.3.8             
#>  [85] evaluate_0.23           codetools_0.2-19        BiocManager_1.30.22    
#>  [88] cli_3.6.2               systemfonts_1.0.6       munsell_0.5.0          
#>  [91] jquerylib_0.1.4         Rcpp_1.0.12             readxl_1.4.3           
#>  [94] globals_0.16.3          parallel_4.3.3          pkgdown_2.0.7          
#>  [97] readr_2.1.5             assertthat_0.2.1        prettyunits_1.2.0      
#> [100] bitops_1.0-7            listenv_0.9.1           rlist_0.4.6.2          
#> [103] scales_1.3.0            crayon_1.5.2            rlang_1.1.3            
#> [106] rvest_1.0.4             distances_0.1.10

mistyR and data formats

Ricardo Omar Ramirez Flores

Jovan Tanevski

2024-03-25

Introduction

Get the data

SpatialExperiment

Seurat

AnnData

Preprocess the data

Normalize counts and drop duplicates

Filtering genes that are expressed in at least 5% of the spots

Run MISTy on pathway specific genes

Defining Estrogen and Hypoxia responsive genes

View composition

Run MISTy and collect results

Interpretation and downstream analysis

Important notes

Other use cases

See also

More examples

Publication

Session info