Network Analysis practical session - Day 1

Martín Garrido Rodríguez-Córdoba, Charlotte Boys, Pau Badia-i-Mompel, Aurelien Dugourd

17-01-2022

About

This vignette contains the material for the first practical session of the Network Analysis course. It was developed by the Saez-Rodriguez group members. The estimated time to go over the content of the tutorial is of ~ 3 hours. This is the schedule for the practical session:

  • Introduction, setup and troubleshooting: 20 minutes
  • Bulk RNA-Seq data quality control and normalization: 30 minutes
  • Differential expression analysis: 30 minutes
  • Break: 10 minutes
  • Functional analysis with MSigDB and decoupleR: 30 minutes
  • Footprint-based analysis with DoRothEA and PROGENy: 45 minutes
  • Conclusions and preparation for day 2: 15 minutes

If you have any question, please contact me at martin.garrido@uni-heidelberg.de. Feedback about the tutorial is also highly appreciated.

Before you start

This tutorial is intended to guide users through the statistical and functional analysis of transcriptomic data. We assume the following skills in the audience:

  • Basic knowledge of git and GitHub.
  • Basic knowledge of the statistical programming language R.
  • Basic knowledge about how to load and make use of external R packages, such as those included in the tidyverse and Bioconductor packages.
  • Be familiar with concepts like omics data, biological databases, exploratory data analysis, hypothesis testing and functional analysis.

This tutorial requires:

  • R >= 4.1.2 You can download an install R from this link.
  • The following R packages are also required:
    • tidyverse
    • edgeR
    • limma
    • msigdbr
    • decoupleR >= 2.0.0
    • progeny
    • dorothea
  • Rstudio is highly recommended to open, run and modify the R code that we will use in this tutorial. You can download and install Rstudio from this link. Please install RStudio Desktop, Open Source Edition, which is free thanks to its Open Source License.

Download and setup instructions

  • Please clone the content of the following repository in your computer: https://github.com/saezlab/NetworkModeling_course_2022
  • If using Rstudio, please double-click the file practical_daty_1/practical_day_1.Rproj
  • The following code chunk takes care of checking and installing those packages in your R installation. Please run it from your computer:
# CRAN packages
cran_packages <- c("tidyverse", "msigdbr", "BiocManager")
for (i in cran_packages) {
  if (!require(i, character.only = TRUE))
    install.packages(i)
}

# BioC packages
bioc_packages <-
  c("decoupleR", "progeny", "dorothea", "limma", "edgeR", "decoupleR")
for (i in bioc_packages) {
  if (!require(i, character.only = TRUE))
    BiocManager::install(i, update = FALSE)
} 

The initial part of this tutorial is based in the work from Law et al. 2016. We encourage users to check this nice article which describes how to analyze RNA-Seq data using limma and edgeR. Similarly, users can also take a look to this tutorial, which offers an alternative pipeline based in the DESeq2 package.

The biology of the tutorial

In this tutorial we will work with a subset of the transcriptomics data available in the GEO entry GSE173201. This entry contains the data accompanying the publication entitled: “Genomic Characterization of Cisplatin Response Uncovers Priming of Cisplatin-Induced Genes in a Resistant Cell Line”. Together with debulking surgery, cisplatin-based chemotherapy constitutes a first line treatment for ovarian cancer patient. Unfortunately, many of them develop tumors which are resistant to this chemotherapy, which results in a 5-year survival rate below 50%. For simplicity, in this tutorial we compare the basal transcriptomic profiles of two cell lines: A2780 and A2780cis. A2780 is a cell line that was established from an Ovarian endometroid adenocarcinoma tumour in an untreated patient. The A2780cis cell line was created after chronically exposing A2780 cells to increasing concentrations of cisplatin to create a cisplatin-resistant cell line.

RNA-Seq

The authors of the study employed RNA sequencing (RNA-Seq) to obtain the basal transcriptomic profiles of both A2780 and A2780cis. RNA-Seq is a technology based on Next Generation Sequencing (NGS) that allows researchers to massively measure the abundance of the RNA molecules in a sample. Starting from the biological sample of interest (e.g. the cell culture), the common steps to perform bulk RNA-Sequencing include: 1) Total RNA extraction and target enrichment, 2) Fragmentation and reverse transcription and amplification (a.k.a. library preparation) and 3) Sequencing of the library (using, for example, Illumina’s NextSeq 500). The output of this process consists of thousands of sequences (a.k.a. reads) which are then aligned to a reference genome/transcriptome to obtain a “count” matrix. This count matrix reflects the number of reads that align in a set of genomic regions of interest, and it represents the level to which that region is being expressed in a given sample. Those regions usually comprise genes, but they can also include other types of transcribed DNA regions, such as microRNAs (miRNAs) or long non-coding RNAs (lncRNAs). This gene expression data is high-throughput, covering on the scale of 104 genes, and gives a detailed snapshot of the transcriptional activity of the cell sample under a given experimental condition at a moment in time.

Traditional RNA-Seq protocols measures the RNA samples obtained from a cell population that can be more or less heterogeneous (bulk). In contrast to them, novel techniques enable researchers to obtain transcriptomic profiles that are resolved at the single-cell level or that are spatially resolved. Here we will focus on bulk RNA-Seq data. For more information about single-cell and spatial transcriptomic technologies, we recommend users to read the comprehensive review written by Stark et al.

Loading packages and data in R

We first load the packages that we will employ during the tutorial

library(tidyverse)
library(edgeR)
library(limma)
library(msigdbr)
library(decoupleR)
library(progeny)
library(dorothea)

Next, we read the data tables that contain the count matrix and samples’ metadata. Those tables are stored in the format .tsv, which stands for “tab-separated values”. A tab-separated values file is a simple text format for storing data in a tabular structure

counts <-
  read.table(
    "data/counts.tsv",
    header = TRUE,
    sep = "\t",
    row.names = 1
  )
metadata <- readr::read_tsv("data/metadata.tsv")

NOTE: In the second line, we employ different functions to read the tables. The first one, read.table(), is a base R function. The second one, read_tsv(), belongs to the readr package, which is part of the tidyverse collection. We use the :: nomenclature to indicate the package that provides the function. This is a good R coding practice, because it prevents the conflicts that may appear as a product of common function names between packages. However, while using such nomenclature is recommended, it is not mandatory when the name of functions is less likely to cause conflicts.

Next, we can explore how the first rows of the count matrix look like. In this matrix, rows correspond to genes (here, labelled by their HUGO gene symbol) and columns represent samples.

head(counts)
##          A2780_rep1 A2780_rep2 A2780_rep3 Acis_rep1 Acis_rep2 Acis_rep3
## TSPAN6          471        477        426       104       112        84
## TNMD             15         15          7         0         0         0
## DPM1            322        299        279       232       242       224
## SCYL3           189        213        162        97       126       115
## C1orf112        388        389        365       325       318       259
## CFH            1860       1822       1710      1983      2066      1945

We can also explore the content of the metadata table, which contain the properties/annotations for each sample. This table may include information like known batch effects or sample pairs. Very often, this information is not explicitly provided but can be inferred from the sample name: e.g. PATIENT1_TUMOR_BATCH1 vs PATIENT2_NORMAL_BATCH2.

metadata
## # A tibble: 6 × 2
##   id         group    
##   <chr>      <chr>    
## 1 A2780_rep1 sensitive
## 2 A2780_rep2 sensitive
## 3 A2780_rep3 sensitive
## 4 Acis_rep1  resistant
## 5 Acis_rep2  resistant
## 6 Acis_rep3  resistant

Quality control and pre-processing

Checking library size

Before being able to compare data from both cell lines, we should perform basic quality control analyses and some pre-processing of the count matrix. In a first step, we can take a look to the total number of sequences per sample, which is also known as library size. This provides information about the sequencing depth and is a good indicator of differences or batch effects in the sequencing process.

data.frame(sample = colnames(counts), n_counts = colSums(counts)) %>%
  ggplot2::ggplot(aes(x = sample, y = n_counts)) +
  ggplot2::geom_col() +
  ggplot2::scale_y_continuous()

NOTE: Inside this code chunk, we employ several functions. First, we create a data frame with the id of the samples and the total number of counts for each of them using the data.frame() and the colSums() functions. Next, we pass this data frame object to the ggplot() function to start creating the plot. This is done thanks to the pipe operand %>%, which “sends” the data frame to the function positioned after it. For more information about the pipe operand please see this documentation page.

NOTE: The ggplot2 package, which is part of the tidyverse collection, is here employed to create the plot. ggplot2 comprises a powerful yet simple framework to create and edit high-quality graphics. For more information, please see ggplot2 homepage.

As it can be observed in the plot, the library sizes range from 4 to 5 millions of sequences. This means that the ratio between the largest and the smallest libraries is of ~ 1.25. This is an acceptable value for most statistical approaches. When this ratio is higher than ~3, ad-hoc adjustments should be made to consider the heterogeneity in library sizes. We will talk more about this in the normalization section of the tutorial.

Filtering lowly expressed genes

Next, the count matrix data can be filtered to remove genes which are lowly expressed across conditions or not expressed at all. The reasons for this are biological as well as statistical. Firstly, genes which are expressed at low levels across the different samples are likely to arise from noise in the sequencing process, or are otherwise not likely to be biologically meaningful and are therefore best ignored in the downstream analysis. Secondly, removing genes with low counts allows the mean-variance relationship to be more reliably estimated and reduces the number of statistical tests performed during differential analysis.

Here, to remove lowly expressed genes we will make use of the function filterByExpr() function from the edgeR package. From Law et al. 2016: “By default, the function keeps genes with about 10 read counts or more in a minimum number of samples, where the number of samples is chosen according to the minimum group sample size. The actual filtering uses CPM values rather than counts in order to avoid giving preference to samples with large library sizes”.

min_counts <- 10

# apply fitlerByExpr function to retrieve a boolean vector indicating which genes should be retained
design_matrix <- model.matrix( ~ 0 + group, data = metadata)
keep <- edgeR::filterByExpr(counts, design = design_matrix, min.count = min_counts)

# use the boolean vector to filter the count matrix
counts_filtered <- counts[keep,]

# print a message to show the number of genes that are retrieved after filtering
message(
  "From ",
  nrow(counts),
  " genes, ",
  sum(!keep),
  " using filterByExpr(). The resulting count matrix contains ",
  nrow(counts_filtered),
  " genes."
)
## From 18630 genes, 5240 using filterByExpr(). The resulting count matrix contains 13390 genes.

NOTE: We created an object called design_matrix to apply the filterByExpr() function. This design matrix encodes the experimental group to which each sample belongs in the form of zero and ones and will be used later in the differential expression analysis.

NOTE: Explore the content of the design_matrix object

QUESTION: What happens when you change the min_counts argument? Why is 10 a good value?

In addition, we can also visualize the per-sample count distribution before and after filtering lowly expressed genes. To this end, we first create a helper function that generates the desired plot

# in order to avoid redundancy, we create a helper function here
plot_count_distribution <- function(c) {
  
  p <- c %>%
    tibble::rownames_to_column(var = "gene_symbol") %>%
    tidyr::pivot_longer(-gene_symbol, names_to = "sample", values_to = "n_counts") %>%
    ggplot2::ggplot(aes(x = log2(n_counts + 1), fill = sample)) +
    ggplot2::geom_histogram(binwidth = 0.1) +
    ggplot2::facet_grid(rows = vars(sample))
  
  return(p)
  
}

And then apply it to both unfiltered counts

plot_count_distribution(counts) + 
  ggplot2::ggtitle("Not filtered")

And filtered counts

plot_count_distribution(counts_filtered) + 
  ggplot2::ggtitle("Filtered")

To aid visualization, we log transformed raw counts. As it can be observed, the low-counts region of the plot becomes less prominent after the filtering.

NOTE: Go to the chunk of code where we create the plot_count_distribution() and substitute the line geom_histogram(binwidth = 0.1) + with geom_density() +. Then run again the chunks that create the plots. What happens? Run help(density) in your terminal to get more information about kernel density estimation.

Normalization

There are multiple factors that can result in libraries of different sizes. Those include experimental variations, batch effects or simply, different sequencing depths. We assume that, if it were not for these variations, all samples should have a similar range and distribution of expression. Therefore, after data filtering, a normalization step is necessary to ensure that gene counts can be compared between samples and experimental conditions. Below, we use the calcNormFactors() function from edgeR to calculate scaling factors for each sample. It uses the approach described in Robinson et al. 2010, entitled “trimmed mean of M values (TMM)”.

# create DGEList object
dge <-
  edgeR::DGEList(counts = counts_filtered, group = metadata$group)

# apply normalization function
dge_norm <- edgeR::calcNormFactors(dge, method = "TMM")

We can now use the norm.factors to generate a matrix of normalized gene expression values. To do so, we employ using the counts per million (CPM) approach that can be applied using the cpm() function from edgeR. In addition, we apply a \(\log_2\) transformation in preparation for the differential expression analysis

norm_expr <- edgeR::cpm(dge_norm, log = TRUE)

And have a look to how the normalized values look like:

head(norm_expr)
##          A2780_rep1 A2780_rep2 A2780_rep3 Acis_rep1 Acis_rep2 Acis_rep3
## TSPAN6     6.573658   6.634064   6.616629  4.593075  4.691209  4.384847
## DPM1       6.028097   5.964049   6.009471  5.735918  5.789353  5.780618
## SCYL3      5.266304   5.478889   5.232354  4.494460  4.858380  4.829742
## C1orf112   6.295424   6.341301   6.394761  6.218794  6.180606  5.988498
## CFH        8.550138   8.562763   8.616796  8.820767  8.872866  8.888483
## FUCA2      5.235803   5.338122   5.334274  4.760052  4.789093  4.673371

QUESTION: Using TMM + CPMs, can we compare the normalized expression values between genes?

Differential Expression Analysis

Next, we can analyze the differences between the transcriptomic profiles of both cell lines. To do so, we carry out what is known as Differential Expression Analysis (DEA). This is a statistical analysis where all genes in the count matrix are compared between the sample groups of interest. To perform DEA, we use the limma package. limma was originally released as a software package to perform the statistical analysis of microarray data, but was later adapted for the analysis of RNA-Seq. limma fits a linear model for each gene according to the experimental design to test the null hypothesis that no gene is differentially expressed between experimental conditions, and calculates a moderated t-statistic. To do so, it uses an empirical Bayes method to improve estimation of the underlying variance. For more information, please see “Shared global parameters link gene-wise models” and “Empirical Bayes borrows information between genes” of the aforementioned paper. limma is not the only option to perform DEA. Common alternatives include edgeR and DESeq2. The strategy that we employ here is known as limma-trend, where the initial count matrix is normalized using edgeR but the DEA is carried out with limma. For more information please see Limma’s users guide.

To carry out the DEA with limma, we employ the lmFit(), contrasts.fit(), eBayes() and topTable() functions using as input the normalized expression matrix

# create the contrast matrix
contrast_matrix <-
  limma::makeContrasts(contrasts = c("groupresistant-groupsensitive"),
                       levels = design_matrix)

# create linear model
fit <- limma::lmFit(norm_expr, design = design_matrix)

# fit contrasts
# "The coefficients, unscaled standard deviations and correlation matrix are re-calculated in terms of the contrasts."
fit_contrasts <-
  limma::contrasts.fit(fit, contrasts = contrast_matrix)

# compute moderated statistics using the global parameteres
ebayes_fit <- limma::eBayes(fit_contrasts)

# retrieve table of differentially expressed genes
de_table <- limma::topTable(ebayes_fit, number = Inf) %>%
  tibble::rownames_to_column(var = "gene")

The main product of this analysis is the differential expression table (de_table). Lets look at its content:

head(de_table)
##       gene     logFC  AveExpr         t      P.Value    adj.P.Val        B
## 1 DLX6-AS1 -9.973853 3.813716 -262.3526 2.673925e-13 2.700734e-09 18.82640
## 2    MEF2C -8.970600 3.312089 -231.6423 5.601056e-13 2.700734e-09 18.59825
## 3     PAX2 -9.018390 3.335984 -228.6486 6.050935e-13 2.700734e-09 18.57195
## 4     EYA4 -8.577307 3.115443 -214.1948 8.917741e-13 2.985214e-09 18.43243
## 5  NAALAD2 -8.222919 2.938249 -198.2238 1.412934e-12 3.342698e-09 18.25008
## 6   COL1A2 -8.581795 3.117687 -191.3908 1.740203e-12 3.342698e-09 18.16130

The documentation for the limma function topTable() gives the following information about the statistics in the data frame:

  • gene: The gene ID
  • logFC: Estimate of the log2-fold-change corresponding to the effect or contrast
  • AveExpr: Average log2-expression for the probe over all arrays and channels
  • t: Moderated t-statistic (omitted for topTableF)
  • P.Value: Raw p-value
  • adj.P.Val: Adjusted p-value or q-value
  • B: Log-odds that the gene is differentially expressed (omitted for topTreat)

For more information about limma’s output, please see this nice post from Gordon Smyth in bioconductor forums,

We can also produce a short summary of the resulting data frame using the base R summary() function:

summary(de_table)
##      gene               logFC               AveExpr              t            
##  Length:13390       Min.   :-10.214733   Min.   : 0.2024   Min.   :-262.3526  
##  Class :character   1st Qu.: -0.329065   1st Qu.: 3.6897   1st Qu.:  -4.0725  
##  Mode  :character   Median : -0.007562   Median : 5.5092   Median :  -0.1098  
##                     Mean   : -0.157601   Mean   : 5.1253   Mean   :  -1.4930  
##                     3rd Qu.:  0.241790   3rd Qu.: 6.6562   3rd Qu.:   3.4227  
##                     Max.   :  8.552186   Max.   :10.1200   Max.   : 142.4021  
##     P.Value            adj.P.Val              B          
##  Min.   :0.0000000   Min.   :0.000000   Min.   :-8.6989  
##  1st Qu.:0.0002445   1st Qu.:0.000978   1st Qu.:-7.5038  
##  Median :0.0104191   Median :0.020837   Median :-4.5690  
##  Mean   :0.1456324   Mean   :0.167759   Mean   :-3.1538  
##  3rd Qu.:0.1695559   3rd Qu.:0.226069   3rd Qu.:-0.3057  
##  Max.   :0.9997481   Max.   :0.999748   Max.   :18.8264

And look at the distribution of the adjusted P values:

ggplot2::ggplot(de_table, aes(x = adj.P.Val)) +
  ggplot2::geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

NOTE: Do you remember how the P-value distribution should look like? If not, please see here.

Our P-value histogram clearly shows an anti-conservative distribution. We can also visualize the gene expression changes using one of the most common plots to explore DE results, the volcano plot:

# set cutoffsa
p_cutoff <- 0.0001
fc_cutoff <- 2

# annotate de_table with status according to cutoffs
de_table <- de_table %>%
  dplyr::mutate(
    status = dplyr::case_when(
      logFC >= fc_cutoff & adj.P.Val <= p_cutoff ~ "Up",
      logFC <= -fc_cutoff & adj.P.Val <= p_cutoff ~ "Down",
      TRUE ~ "Other"
    )
  )

#  volcano plots
ggplot2::ggplot(de_table, aes(x = logFC,-log10(adj.P.Val), color = status)) +
  ggplot2::geom_point() +
  ggplot2::geom_vline(xintercept = log2(c(0.25, 0.5, 2, 4)), lty = 2) +
  ggplot2::geom_hline(yintercept = -log10(c(0.05, 0.01, 0.001, 0.0001)), lty = 2) +
  ggplot2::scale_color_manual(values = c(
    "Up" = "red",
    "Down" = "blue",
    "Other" = "black"
  ))

A volcano plot enables us to quickly visualize the magnitude (logFC) and significance (-log10(pvalue)) of DE changes. Each point represent a gene, and its color indicates whether they surpass or not a cutoff of an absolute logFC > 2 and an adjusted P value < 10^{-4}.

QUESTION: What do the horizontal and vertical dashed lines mean in this plot? Take a look at the code and think about their meaning.

QUESTION: Try to modify the fc_cutoff and and p_cutoff variables in this chunk. What happens? What cutoff values should we use?

Functional analysis

Thanks to the DEA, we know which genes are different between the two cell lines, with measurements that indicate us the magnitude and the significance of the changes. But now we face one of the most prominent bottlenecks in the analysis of omics data: The extraction of biological insights that we can further interpret and link to a particular phenotype. A manual exploration of the genes is not an option, given that our dataset comprises 13390 genes, from which 994 surpassed the previously established cutoffs. Hence, we need an approach that enable us to reduce the number of features to interpret, and that help us to link observation from our dataset with previous knowledge. Welcome to the functional analysis section of the tutorial.

In the next steps, we employ prior biological knowledge retrieved from multiple biological resources. To perform the different analyses we employ the decoupleR framework, a recent development from the Saez-Rodriguez group which offers 12 different statistical approaches to perform the functional analysis of omics data.

To perform the functional analysis with decoupleR, we need to prepare the results from our DEA in the appropriate format. As decoupleR functions expect a matrix as input, we are going to create a one-column matrix from the de_table object. This one column matrix contains the vector of T values obtained from limma. We choose the T values over other metrics because it combines the significance and the magnitude of the DE changes.

decoupler_input <- tibble::column_to_rownames(de_table, "gene") %>%
  dplyr::select(t) %>%
  dplyr::arrange(desc(t)) %>%
  as.matrix()

We can explore how those T values are distributed using a histogram:

de_table %>%
  dplyr::arrange(desc(t)) %>%
  dplyr::mutate(gene = forcats::fct_inorder(gene)) %>%
  ggplot2::ggplot(aes(x = t)) +
  ggplot2::geom_histogram(binwidth = 10) 

The functional categories of MSigDB

In this section, we work with the prior biological knowledge that can be retrieved from the Molecular Signatures Database (MSigDB) database. This database contains functional annotations for human genes and is divided in 9 major collections. For simplicity, we only work here with data retrieved from one of them, but all the collections can retrieve interesting insights if the appropriate question is asked (e.g. genome positional gene sets). We will employ the functional categories defined by WikiPathways. WikiPathways is a manually curated database of biological pathways. It contains maps for multiple types of biological processes that range from protein-driven cellular signaling, to metabolism and genomic regulation. We first download the annotation tables using the msigdbr package:

# retrieve GO terms table
wikipathways <- msigdbr::msigdbr(species = "Homo sapiens",
                                 category = "C2",
                                 subcategory = "CP:WIKIPATHWAYS")

And have a look to its content:

head(wikipathways)
## # A tibble: 6 × 15
##   gs_cat gs_subcat       gs_name gene_symbol entrez_gene ensembl_gene human_gene_symb…
##   <chr>  <chr>           <chr>   <chr>             <int> <chr>        <chr>           
## 1 C2     CP:WIKIPATHWAYS WP_15Q… AC091304.1       645202 ENSG0000023… AC091304.1      
## 2 C2     CP:WIKIPATHWAYS WP_15Q… AC091304.1       645202 ENSG0000027… AC091304.1      
## 3 C2     CP:WIKIPATHWAYS WP_15Q… AC091304.1       645202 ENSG0000028… AC091304.1      
## 4 C2     CP:WIKIPATHWAYS WP_15Q… AC100756.2    102723623 ENSG0000027… AC100756.2      
## 5 C2     CP:WIKIPATHWAYS WP_15Q… AC100756.2    102723623 ENSG0000028… AC100756.2      
## 6 C2     CP:WIKIPATHWAYS WP_15Q… CYFIP1            23191 ENSG0000027… CYFIP1          
## # … with 8 more variables: human_entrez_gene <int>, human_ensembl_gene <chr>,
## #   gs_id <chr>, gs_pmid <chr>, gs_geoid <chr>, gs_exact_source <chr>,
## #   gs_url <chr>, gs_description <chr>

Next, we transform them into objects compatible with decoupleR

# helper function to transform msigdb data frames into decoupler compatible data frames
msigdb_to_decoupler <- function(df) {
  
  output <- df %>%
    dplyr::transmute(
      source = gs_name,
      target = gene_symbol,
      mor = 1,
      likelihood = 1
    ) %>%
    dplyr::distinct()
  return(output)
  
}

# apply to data frame
wiki_decoupler <- msigdb_to_decoupler(wikipathways)

Over-representation Analysis (ORA)

The first method that we will employ to perform the functional analysis of our results is called over-representation analysis (ORA). In brief, we select the top N up and down regulated genes according to their T values and evaluate which functional categories are over-represented on them (as explained in the theoretical part of the course). To do so, we employ a hyper-geometric tests the null of independence of rows and columns in contingency tables that take the following form:

Genes in category Genes not in category (Background)
Altered genes 30 270
Not altered genes 70 9930

And which hence tests the following hypotheses:

  • H0: Genes belonging to the category are altered in the same proportion than the rest of genes
  • H1: Genes belonging to the category are altered a higher proportion than the rest of genes

To carry out the over-representation analysis in the top N up and down regulated genes, we use the run_ora() function from decoupleR:

# set the top N
n_top <- 50

# run for up-regulated genes using n_up
ora_up <- decoupleR::run_ora(
  mat = decoupler_input,
  network = wiki_decoupler,
  n_up = n_top,
  n_bottom = 0,
  .source = "source",
  .target = "target"
) %>%
  dplyr::mutate(adj_p = p.adjust(p_value, method = "BH"))

# run for down-regulated genes using n_bottom
ora_down <- decoupleR::run_ora(
  mat = decoupler_input,
  network = wiki_decoupler,
  n_up = 0,
  n_bottom = n_top,
  .source = "source",
  .target = "target"
) %>%
  dplyr::mutate(adj_p = p.adjust(p_value, method = "BH"))

# bind both data frames
ora_results <- list(Up = ora_up, Down = ora_down) %>%
  dplyr::bind_rows(.id = "status") %>%
  dplyr::arrange(p_value)

QUESTION: What is the p.adjust() function? Why do we apply it here?

And look to top 10 up and down regulated functional categories:

# plot them
ora_results %>%
  dplyr::group_by(status) %>%
  dplyr::slice_head(n = 10) %>%
  dplyr::mutate(source = forcats::fct_rev(forcats::fct_inorder(source))) %>%
  ggplot2::ggplot(aes(x = -log10(p_value), y = source, fill = status)) +
  ggplot2::geom_bar(stat = "identity") +
  ggplot2::scale_fill_manual(values = c("Up" = "red", "Down" = "blue")) 

QUESTION: Do you think that the ORA results are significant? Why? Tip: Check the content of the ora_results object.

QUESTION: Try to perform the same analysis using a different number of top up and down regulated genes modifying the n_top variable. How do the results change?

Set Enrichment Analysis (SEA)

In the previous section, we selected a number of top up and down regulated genes to perform the ORA. The number of genes has a profound impact on the results and setting this threshold is not always easy. An alternative to ORA are the methods that belong to the category of “Functional Class Scoring (FCS)”. These type of methods became very popular after the publication of the Gene Set Enrichment Analysis (GSEA) method by Subramanian et al. 2005.

In contrast to ORA, the SEA does not require users to establish an arbitrary threshold of genes that are up or down regulated and which are worth exploring in a functional context. Instead, it uses a summary statistic for all the genes that belong to a certain functional category and assess the significance of the results, usually through the calculation of permutation-based null distributions. By doing so, it assumes that the members of a de-regulated functional category should have a more extreme position in the list of input genes when those are ranked by a metric that describes a phenotype (e.g. The T value or the logFC that we calculate in this tutorial).

We are now going to try one of these approaches employing the run_wmean() function from the decoupleR package. This function will perform a weighted mean of the T values from genes that belong to a given functional category. In this example, we only use a weight of 1 to indicate whether a gene belongs or not to a pathway. More sophisticated analyses can be carried out using both the .mor and .likelihood parameters, as we will see below. We now carry out the FCS-like analysis using decoupleR:

wmean_results <- run_wmean(
  mat = decoupler_input,
  network = wiki_decoupler,
  times = 2000,
  .source = "source",
  .target = "target",
  .mor = "mor",
  .likelihood = "likelihood"
) %>%
  subset(statistic == "norm_wmean") %>%
  dplyr::mutate(adj_p = p.adjust(p_value, method = "BH"))

And plot the top 10 up and down regulated pathways according to the results of the run_wmean() function:

# given that we are going to create several plots like this, we create a helper function
plot_fcs_results <- function(df) {
  
  p <- df %>%
    dplyr::mutate(status = ifelse(score > 0, "Up", "Down")) %>%
    dplyr::arrange(p_value) %>%
    dplyr::group_by(status) %>%
    dplyr::slice_head(n = 10) %>%
    dplyr::arrange(score) %>%
    dplyr::mutate(source = forcats::fct_inorder(source)) %>%
    ggplot2::ggplot(aes(x = score, y = source, fill = status)) +
    ggplot2::geom_bar(stat = "identity") +
    ggplot2::scale_fill_manual(values = c("Up" = "red", "Down" = "blue"))
  
  return(p)
  
}

plot_fcs_results(wmean_results)

QUESTION: What do the Y axis and X axis represent in this plot?

QUESTION: Are these results significant? Why?

QUESTION: What does the times parameter control inside the run_wmean() function? Try to tune this parameter and see what happens (WARNING: Do not set it above 10000).

Footprint-based analyses

Until now, we have used the functional categories as specified in WikiPathways. We have only considered whether a gene belongs to a given pathway or not. In addition, we have used the gene expression changes as a proxy of the pathway activity changes. Unconsciously, we have made the following assumption: The gene expression measurements (as obtained from RNA-Seq) are a good proxy of the resulting gene product (proteins) activities. This is a big assumption and a huge leap in within the central dogma of biology. Some articles like Liu et al 2016 or Buccitelli et al. 2020 provide a good perspective about the topic.

A new generation of computational methods, known as “footprint-based analysis” do not make this assumption, but instead employ the omics data as a signatures of upstream biological activities.

Figure from Dugourd et al. 2019.

DoRothEA and PROGENy are prior knowledge resources which we can couple with a statistical method to extract insights about biological activities from transcriptomic data.

Transcription factor activity estimation

Transcription factors (TFs) are proteins that regulate the transciption rate of genes. When active, these proteins binds to specific DNA regions, promoting or inhibiting the expression of certain genes. Given the biological proximity of TFs to transcriptomic changes, it is reasonable to assume that we can employ this omic signature to infer the regulatory activity of upstream TFs. But for this, we need to know the (positive or negative) interactions between certain TFs and genes. DoRothEA is a prior knowledge resource that provides exactly this type of information. DoRothEA interactions define TF-gene regulatory mechanisms and were “curated and collected from different types of evidence such as literature curated resources, ChIP-seq peaks, TF binding site motifs and interactions inferred directly from gene expression”. For more information about DoRothEA, please see Garcia-Alonso et al. 2019. We can retrieve the content of DoRothEA from the R package dorothea. Lets have a look to its content.

dorothea_interactions <- dorothea::dorothea_hs %>%
  subset(confidence %in% c("A", "B", "C")) %>%
  dplyr::mutate(likelihood = 1)

head(dorothea_interactions)
## # A tibble: 6 × 5
##   tf    confidence target   mor likelihood
##   <chr> <chr>      <chr>  <dbl>      <dbl>
## 1 AHR   C          CYP1A1     1          1
## 2 AHR   C          CYP1A2     1          1
## 3 AHR   C          CYP1B1     1          1
## 4 AHR   C          FOS        1          1
## 5 AHR   C          MYC        1          1
## 6 AHR   C          UGT1A6     1          1

OPTIONAL: Explore the prior knowledge using networks and showing basic graph theory concepts like node, interaction, degree… etc

QUESTION: What does the “confidence” column represent? What defines the different levels of evidence in this resource?

QUESTION: What does the “mor” column represent? Discuss about the sign of interactions in prior knowledge and biological networks.

Similar to what we have previously done with the functional categories based on WikiPathways, we can apply the run_wmean() function from decoupler using the functional sets defined by DoRothEA TFs.

tf_activity_results <- run_wmean(
  mat = decoupler_input, 
  network =dorothea_interactions, 
  times = 2000,
  .source = "tf",
  .target = "target",
  .mor = "mor",
  .likelihood = "likelihood"
) %>%
  subset(statistic == "norm_wmean") %>%
  dplyr::mutate(adj_p = p.adjust(p_value, method = "BH"))

And visualize the top altered TFs

plot_fcs_results(tf_activity_results)

Prediciting the activity of signaling pahtways with PROGENy

While TFs are very close to transcriptomic changes, protein-driven signaling pathways are not. This is why estimating their activity from transcriptomic data is not a straightforward task. Following the footprint-based philosophy, PROGENy is a resource that address this gap. In essence, PROGENy is a resource that was built using information from > 500 experiments where one or more signaling pathways were perturbed. By analyzing those experiments, we created a resource that provides information about which transcriptomic changes are associated to certain signaling pathways perturbations. For more information about PROGENy, please see Schubert et al. 2018

Similar to the dorothea package, the progeny package provides users with the pathway-gene interactions. We can retrieve it and format it according to decoupler requirements.

# adapt progeny interactions to decoupler format
progeny_interactions <-
  progeny::getModel(organism = "Human", top = 300) %>%
  as.data.frame() %>%
  tibble::rownames_to_column(var = "target") %>%
  tidyr::pivot_longer(-target, names_to = "pathway", values_to = "weight") %>%
  subset(weight != 0) %>%
  dplyr::mutate(mor = sign(weight), likelihood = abs(weight))

# show the first interactions
head(subset(progeny_interactions, weight != 0))
## # A tibble: 6 × 5
##   target pathway  weight   mor likelihood
##   <chr>  <chr>     <dbl> <dbl>      <dbl>
## 1 A2M    PI3K      -4.13    -1       4.13
## 2 AAMDC  MAPK      -1.19    -1       1.19
## 3 AASS   VEGF      -2.77    -1       2.77
## 4 ABCA1  Estrogen  -2.93    -1       2.93
## 5 ABCA12 p53        9.60     1       9.60
## 6 ABCA5  EGFR      -1.81    -1       1.81

QUESTION: What does the “likelihood” column represent? Discuss about the likelihood of interactions, or more general, about the edge weights in prior knowledge networks.

Once with this information, we can perform the pathway activity estimation using, again the functions provided on the decoupler package.

progeny_results <- run_wmean(
  mat = decoupler_input,
  network = progeny_interactions,
  times = 2000,
  .source = "pathway",
  .target = "target",
  .mor = "mor",
  .likelihood = "likelihood"
) %>%
  subset(statistic == "norm_wmean") %>%
  dplyr::mutate(adj_p = p.adjust(p_value, method = "BH"))

And visualize the results

plot_fcs_results(progeny_results)

QUESTION: How different is the output of this analysis in comparison with the previous ones? What do the Y axis represent? Are the results significant?

QUESTION: How do the results of this analysis correlate with the biology of ovarian cancer?

Take home message and preparation for 2nd day

Today we analyzed a bulk RNA-Seq dataset starting from the raw counts matrix. We performed the quality control and the pre-processing of the count matrix, and applied the TMM+CPM method to obtain the normalized gene expression measurements. We then compared the basal transcriptome of the two cell lines through a differential expression analysis. In a final step, we obtained a set of functional results that enable us to interpret and link our data with the chemoresistant cell line phenotype.

While the first sections are specific to RNA-Seq data, the functional analysis methodx that we show here are also applicable to other types of sequencing data, proteomics data and even metabolomics data. The key aspect of the functional analysis is to select a prior knowledge resource that matches our data and that serves to answer a specific question.

Tomorrow you will learn how to employ network methods to generate mechanistic hypotheses. Most of the prior knowledge resources that we employed today can be represented as networks, and as you will see tomorrow, those offer a convenient framework for the mathematical modeling of certain biological processes.

Session Information

sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] edgeR_3.34.1        limma_3.48.3        dorothea_1.4.2     
##  [4] progeny_1.14.0      decoupleR_2.1.3     BiocManager_1.30.16
##  [7] msigdbr_7.4.1       forcats_0.5.1       stringr_1.4.0      
## [10] dplyr_1.0.7         purrr_0.3.4         readr_2.0.2        
## [13] tidyr_1.1.4         tibble_3.1.5        ggplot2_3.3.5      
## [16] tidyverse_1.3.1    
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7                matrixStats_0.61.0         
##  [3] fs_1.5.0                    bit64_4.0.5                
##  [5] lubridate_1.8.0             httr_1.4.2                 
##  [7] GenomeInfoDb_1.28.4         tools_4.1.2                
##  [9] backports_1.2.1             bslib_0.3.1                
## [11] utf8_1.2.2                  R6_2.5.1                   
## [13] rpart_4.1-15                DBI_1.1.1                  
## [15] BiocGenerics_0.38.0         colorspace_2.0-2           
## [17] withr_2.4.2                 gridExtra_2.3              
## [19] tidyselect_1.1.1            bit_4.0.4                  
## [21] compiler_4.1.2              cli_3.0.1                  
## [23] rvest_1.0.1                 Biobase_2.52.0             
## [25] xml2_1.3.2                  DelayedArray_0.18.0        
## [27] labeling_0.4.2              bookdown_0.24              
## [29] sass_0.4.0                  scales_1.1.1               
## [31] speedglm_0.3-3              digest_0.6.28              
## [33] rmarkdown_2.11              XVector_0.32.0             
## [35] pkgconfig_2.0.3             htmltools_0.5.2            
## [37] bcellViper_1.28.0           MatrixGenerics_1.4.3       
## [39] highr_0.9                   dbplyr_2.1.1               
## [41] fastmap_1.1.0               rlang_0.4.11               
## [43] readxl_1.3.1                rstudioapi_0.13            
## [45] farver_2.1.0                jquerylib_0.1.4            
## [47] generics_0.1.0              jsonlite_1.7.2             
## [49] vroom_1.5.5                 RCurl_1.98-1.5             
## [51] magrittr_2.0.1              GenomeInfoDbData_1.2.6     
## [53] Matrix_1.3-4                Rcpp_1.0.7                 
## [55] munsell_0.5.0               S4Vectors_0.30.2           
## [57] fansi_0.5.0                 babelgene_21.4             
## [59] lifecycle_1.0.1             stringi_1.7.5              
## [61] yaml_2.2.1                  MASS_7.3-54                
## [63] SummarizedExperiment_1.22.0 zlibbioc_1.38.0            
## [65] grid_4.1.2                  ggrepel_0.9.1              
## [67] parallel_4.1.2              crayon_1.4.1               
## [69] lattice_0.20-45             haven_2.4.3                
## [71] hms_1.1.1                   locfit_1.5-9.4             
## [73] knitr_1.36                  pillar_1.6.3               
## [75] ranger_0.13.1               GenomicRanges_1.44.0       
## [77] stats4_4.1.2                reprex_2.0.1               
## [79] glue_1.4.2                  evaluate_0.14              
## [81] modelr_0.1.8                vctrs_0.3.8                
## [83] rmdformats_1.0.3            tzdb_0.1.2                 
## [85] cellranger_1.1.0            gtable_0.3.0               
## [87] assertthat_0.2.1            xfun_0.26                  
## [89] broom_0.7.9                 RobustRankAggreg_1.1       
## [91] IRanges_2.26.0              ellipsis_0.3.2