Gene Set Variation Analysis (GSVA)

Calculates regulatory activities using GSVA.

Usage

run_gsva(
  mat,
  network,
  .source = source,
  .target = target,
  verbose = FALSE,
  method = c("gsva", "plage", "ssgsea", "zscore"),
  minsize = 5L,
  maxsize = Inf,
  ...
)

Arguments

mat

Matrix to evaluate (e.g. expression matrix). Target nodes in rows and conditions in columns. rownames(mat) must have at least one intersection with the elements in network .target column.

network

Tibble or dataframe with edges and it's associated metadata.

.source

Column with source nodes.

.target

Column with target nodes.

verbose

Gives information about each calculation step. Default: FALSE.

method

Method to employ in the estimation of gene-set enrichment. scores per sample. By default this is set to gsva (Hänzelmann et al, 2013). Further available methods are "plage", "ssgsea" and "zscore". Read more in the manual of GSVA::gsva.

minsize

Integer indicating the minimum number of targets per source. Must be greater than 0.

maxsize

Integer indicating the maximum number of targets per source.

...

Arguments passed on to GSVA::gsvaParam, GSVA::ssgseaParam

assay

The name of the assay to use in case exprData is a multi-assay container, otherwise ignored. By default, the first assay is used.

annotation

The name of a Bioconductor annotation package for the gene identifiers occurring in the row names of the expression data matrix. This can be used to map gene identifiers occurring in the gene sets if those are provided in a GeneSetCollection. By default gene identifiers used in expression data matrix and gene sets are matched directly.

kcdf

Character vector of length 1 denoting the kernel to use during the non-parametric estimation of the cumulative distribution function of expression levels across samples. By default, kcdf="Gaussian" which is suitable when input expression values are continuous, such as microarray fluorescent units in logarithmic scale, RNA-seq log-CPMs, log-RPKMs or log-TPMs. When input expression values are integer counts, such as those derived from RNA-seq experiments, then this argument should be set to kcdf="Poisson".

tau

Numeric vector of length 1. The exponent defining the weight of the tail in the random walk performed by the GSVA (Hänzelmann et al., 2013) method. The default value is 1 as described in the paper.

maxDiff

Logical vector of length 1 which offers two approaches to calculate the enrichment statistic (ES) from the KS random walk statistic.

FALSE: ES is calculated as the maximum distance of the random walk from 0.
TRUE (the default): ES is calculated as the magnitude difference between the largest positive and negative random walk deviations.

absRanking

Logical vector of length 1 used only when maxDiff=TRUE. When absRanking=FALSE (default) a modified Kuiper statistic is used to calculate enrichment scores, taking the magnitude difference between the largest positive and negative random walk deviations. When absRanking=TRUE the original Kuiper statistic that sums the largest positive and negative random walk deviations, is used. In this latter case, gene sets with genes enriched on either extreme (high or low) will be regarded as ’highly’ activated.

alpha

Numeric vector of length 1. The exponent defining the weight of the tail in the random walk performed by the ssGSEA (Barbie et al., 2009) method. The default value is 0.25 as described in the paper.

normalize

Logical vector of length 1; if TRUE runs the ssGSEA method from Barbie et al. (2009) normalizing the scores by the absolute difference between the minimum and the maximum, as described in their paper. Otherwise this last normalization step is skipped.

Value

A long format tibble of the enrichment scores for each source across the samples. Resulting tibble contains the following columns:

statistic: Indicates which method is associated with which score.
source: Source nodes of network.
condition: Condition representing each column of mat.
score: Regulatory activity (enrichment score).

Details

GSVA (Hänzelmann et al., 2013) starts by transforming the input molecular readouts in mat to a readout-level statistic using Gaussian kernel estimation of the cumulative density function. Then, readout-level statistics are ranked per sample and normalized to up-weight the two tails of the rank distribution. Afterwards, an enrichment score gsva is calculated using a running sum statistic that is normalized by subtracting the largest negative estimate from the largest positive one.

Hänzelmann S. et al. (2013) GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics, 14, 7.

Examples

inputs_dir <- system.file("testdata", "inputs", package = "decoupleR")

mat <- readRDS(file.path(inputs_dir, "mat.rds"))
net <- readRDS(file.path(inputs_dir, "net.rds"))

run_gsva(mat, net, minsize=1, verbose = FALSE)
#> # A tibble: 72 × 4
#>    statistic source condition score
#>    <chr>     <chr>  <chr>     <dbl>
#>  1 gsva      T1     S01       0.222
#>  2 gsva      T1     S02       0.556
#>  3 gsva      T1     S03       0.667
#>  4 gsva      T1     S04       0.778
#>  5 gsva      T1     S05       0.556
#>  6 gsva      T1     S06       0.667
#>  7 gsva      T1     S07       0.667
#>  8 gsva      T1     S08       0.667
#>  9 gsva      T1     S09       0.889
#> 10 gsva      T1     S10       0.444
#> # ℹ 62 more rows

Usage

Arguments

Value

Details

See also

Examples