Sample Metadata Analysis

Tissue metabolomics experiment is a standard metabolomics experiment using tissue samples (e.g. from animals or patients).

In this tutorial we showcase how to use MetaProViz:

to perform differential metabolite analysis (dma) to generate Log2FC and statistics and perform pathway analysis using Over Representation Analysis (ORA) on the results.
to do metabolite clustering analysis (MCA) to find clusters of metabolites with similar behaviors based on patients demographics like age, gender and tumour stage.
Find the main metabolite drivers that separate patients based on their demographics like age, gender and tumour stage.

First if you have not done yet, install the required dependencies and load the libraries:

# 1. Install Rtools if you haven’t done this yet, using the appropriate version (e.g.windows or macOS).
# 2. Install the latest development version from GitHub using devtools
# devtools::install_github("https://github.com/saezlab/MetaProViz")

library(MetaProViz)

# dependencies that need to be loaded:
library(magrittr)
library(dplyr)
library(rlang)
library(tidyr)
library(tibble)
library(stringr)

# Please install the Biocmanager Dependencies:
# BiocManager::install("clusterProfiler")
# BiocManager::install("EnhancedVolcano")

1. Loading the example data

Here we choose an example datasets, which is publicly available in the paper “An Integrated Metabolic Atlas of Clear Cell Renal Cell Carcinoma”, which includes metabolomic profiling on 138 matched clear cell renal cell carcinoma (ccRCC)/normal tissue pairs (Hakimi et al. 2016). Metabolomics was done using The company Metabolon, so this is untargeted metabolomics. Here we use the median normalised data from the supplementary table 2 of the paper. We have combined the metainformation about the patients with the metabolite measurements and removed unidentified metabolites. Lastly, we have added a column “Stage” where Stage1 and Stage2 patients are summarised to “EARLY-STAGE” and Stage3 and Stage4 patients to “LATE-STAGE”. Moreover, we have added a column “Age”, where patients with “AGE AT SURGERY” <42 are defined as “Young” and patients with AGE AT SURGERY >58 as “Old” and the remaining patients as “Middle”.

#As part of the MetaProViz package you can load the example data into your global environment using the function toy_data():
1. Tissue experiment (Intra)
We can access the built-in dataset tissue_norm, which includes columns with Sample information and columns with the median normalised measured metabolite integrated peaks.

# Load the example data:
Tissue_Norm <- tissue_norm%>%
  column_to_rownames("Code")

Preview of the DF `Tissue_Norm` including columns with sample information and metabolite ids with their measured values.
	TISSUE_TYPE	GENDER	AGE_AT_SURGERY	TYPE-STAGE	STAGE	AGE	1,2-propanediol	1,3-dihydroxyacetone
DIAG-16076	TUMOR	Male	74.7778	TUMOR-STAGE I	EARLY-STAGE	Old	0.710920	0.809182
DIAG-16077	NORMAL	Male	74.7778	NORMAL-STAGE I	EARLY-STAGE	Old	0.339390	0.718725
DIAG-16078	TUMOR	Male	77.1778	TUMOR-STAGE I	EARLY-STAGE	Old	0.413386	0.276412
DIAG-16079	NORMAL	Male	77.1778	NORMAL-STAGE I	EARLY-STAGE	Old	1.595697	4.332451
DIAG-16080	TUMOR	Female	59.0889	TUMOR-STAGE I	EARLY-STAGE	Old	0.573787	0.646791

2. Additional information mapping the trivial metabolite names to KEGG IDs, HMDB IDs, etc. and selected pathways (MappingInfo)

Tissue_MetaData <- tissue_meta%>%
  column_to_rownames("Metabolite")

Preview of the DF `Tissue_MetaData` including the trivial metabolite identifiers used in the experiment as well as IDs and pathway information.
	SUPER_PATHWAY	SUB_PATHWAY	CAS	PUBCHEM	KEGG	Group_HMDB
1,2-propanediol	Lipid	Ketone bodies	57-55-6;	NA	C00583	HMDB01881
1,3-dihydroxyacetone	Carbohydrate	Glycolysis, gluconeogenesis, pyruvate metabolism	62147-49-3;	670	C00184	HMDB01882
1,5-anhydroglucitol (1,5-AG)	Carbohydrate	Glycolysis, gluconeogenesis, pyruvate metabolism	154-58-5;	NA	C07326	HMDB02712
10-heptadecenoate (17:1n7)	Lipid	Long chain fatty acid	29743-97-3;	5312435	NA	NA
10-nonadecenoate (19:1n9)	Lipid	Long chain fatty acid	73033-09-7;	5312513	NA	NA

2. Run MetaProViz Analysis

Pre-processing

This has been done by the authors of the paper and we will use the median normalized data. If you want to know how you can use the MetaProViz pre-processing module, please check out the vignette:
- Standard metabolomics data
- Consumption-Release (core) metabolomics data from cell culture media

Metadata analysis

We can use the patient’s metadata to find the main metabolite drivers that separate patients based on their demographics like age, gender, etc.

Here the metadata analysis is based on principal component analysis (PCA), which is a dimensionality reduction method that reduces all the measured features (=metabolites) of one sample into a few features in the different principal components, whereby each principal component can explain a certain percentage of the variance between the different samples. Hence, this enables interpretation of sample clustering based on the measured features (=metabolites).
The MetaProViz::metadata_analysis() function will perform PCA to extract the different PCs followed by annova to find the main metabolite drivers that separate patients based on their demographics.

MetaRes <- MetaProViz::metadata_analysis(data=Tissue_Norm[,-c(1:13)],
                                     metadata_sample= Tissue_Norm[,c(2,4:5,12:13)],
                                     scaling = TRUE,
                                     percentage = 0.1,
                                     cutoff_stat= 0.05,
                                     cutoff_variance = 1)
#> The column names of the 'metadata_sample' contain special character that where removed.

Ultimately, this is leading to clusters of metabolites that are driving the separation of the different demographics.

We generated the general anova output DF:

Preview of the DF MetaRes[[`res_aov`]] including the main metabolite drivers that separate patients based on their demographics.
	PC	tukeyHSD_Contrast	term	anova_sumsq	anova_meansq	anova_statistic	anova_p.value	tukeyHSD_p.adjusted	Explained_Variance
1	PC1	TUMOR-NORMAL	TISSUE_TYPE	7375.2212254	7375.2212254	90.1352800	0.0000000	0.0000000	19.0079573
2	PC1	Black-Asian	RACE	156.7549451	52.2516484	0.4795311	0.6967838	0.9992824	19.0079573
1777	PC232	White-Asian	RACE	0.2726952	0.0908984	0.9544580	0.4147472	0.5495634	0.0166997
1778	PC232	LATE-STAGE-EARLY-STAGE	STAGE	0.0360319	0.0360319	0.3776760	0.5393596	0.5393596	0.0166997
3191	PC9	White-Other	RACE	20.2626222	6.7542074	0.7090398	0.5473277	0.8897908	1.6658973
3192	PC9	Young-Middle	AGE	49.2030973	24.6015486	2.6213834	0.0745317	0.9979475	1.6658973

We generated the summarised results output DF, where each feature (=metabolite) was assigned a main demographics parameter this feature is separating:

Preview of the DF MetaRes[[`res_summary`]] including the metabolite drivers in rows and list the patients demographics they can separate.
feature	term	Sum(Explained_Variance)	MainDriver	MainDriver_Term	MainDriver_Sum(VarianceExplained)
N2-methylguanosine	AGE, GENDER, RACE, STAGE, TISSUE_TYPE	3.7598809871366, 2.75344828467363, 1.43932034747081, 25.2803273612931, 33.9484968841434	FALSE, FALSE, FALSE, FALSE, TRUE	TISSUE_TYPE	33.9485
5-methyltetrahydrofolate (5MeTHF)	AGE, GENDER, RACE, STAGE, TISSUE_TYPE	0.35114340819656, 0.29468807988021, 0.252515004077172, 19.4058160726611, 32.5442969233501	FALSE, FALSE, FALSE, FALSE, TRUE	TISSUE_TYPE	32.5443
N-acetylalanine	AGE, GENDER, RACE, STAGE, TISSUE_TYPE	0.381811888038144, 1.82507161869685, 2.97460134435356, 19.0079573465896, 32.5442969233501	FALSE, FALSE, FALSE, FALSE, TRUE	TISSUE_TYPE	32.5443
N-acetyl-aspartyl-glutamate (NAAG)	AGE, GENDER, RACE, STAGE, TISSUE_TYPE	0.212976478603115, 2.75344828467363, 0.235952044704099, 20.3572056704699, 32.2825995362096	FALSE, FALSE, FALSE, FALSE, TRUE	TISSUE_TYPE	32.2826
1-heptadecanoylglycerophosphoethanolamine*	AGE, GENDER, RACE, STAGE, TISSUE_TYPE	4.33031140898825, 0.267851018697989, 0.437853800763464, 22.8424358182214, 30.8783995754163	FALSE, FALSE, FALSE, FALSE, TRUE	TISSUE_TYPE	30.8784
1-linoleoylglycerophosphoethanolamine*	AGE, STAGE, TISSUE_TYPE	4.2011500430443, 22.7555733683955, 30.8783995754163	FALSE, FALSE, TRUE	TISSUE_TYPE	30.8784

##1. Tissue_Type
TissueTypeList <- MetaRes[["res_summary"]]%>%
  filter(MainDriver_Term == "TISSUE_TYPE")%>%
  filter(`MainDriver_Sum(VarianceExplained)`>30)%>%
  select(feature)%>%
  pull()

# select columns tissue_norm that are in TissueTypeList if they exist
Input_Heatmap <- Tissue_Norm[ , names(Tissue_Norm) %in% TissueTypeList]#c("N1-methylguanosine", "N-acetylalanine", "lysylmethionine")

# Heatmap: Metabolites that separate the demographics, like here TISSUE_TYPE
MetaProViz:::viz_heatmap(data = Input_Heatmap,
                       metadata_sample = Tissue_Norm[,c(1:13)],
                       metadata_info = c(color_Sample = list("TISSUE_TYPE")),
                       scale ="column",
                      plot_name = "MainDrivers")

DMA

Here we use Differential Metabolite Analysis (dma) to compare two conditions (e.g. Tumour versus Healthy) by calculating the Log2FC, p-value, adjusted p-value and t-value.
For more information please see the vignette:
- Standard metabolomics data
- Consumption-Release (core) metabolomics data from cell culture media

We will perform multiple comparisons based on the different patient demographics available: 1. Tumour versus Normal: All patients 2. Tumour versus Normal: Subset of Early Stage patients 3. Tumour versus Normal: Subset of Late Stage patients 4. Tumour versus Normal: Subset of Young patients 5. Tumour versus Normal: Subset of Old patients

# Prepare the different selections
EarlyStage <- Tissue_Norm %>%
  filter(STAGE== "EARLY-STAGE")
LateStage <- Tissue_Norm %>%
  filter(STAGE=="LATE-STAGE")
Old <- Tissue_Norm %>%
  filter(AGE=="Old")
Young <- Tissue_Norm%>%
  filter(AGE=="Young")

DFs <- list("TissueType"= Tissue_Norm,"EarlyStage"= EarlyStage, "LateStage"= LateStage, "Old"= Old, "Young"=Young)

# Run dma
ResList <- list()
for(item in names(DFs)){
  #Get the right DF:
  data <- DFs[[item]]

  message(paste("Running dma for", item))
  #Create folder for saving each comparison
  dir.create(paste(getwd(),"/MetaProViz_Results/dma/", sep=""), showWarnings = FALSE)
  dir.create(paste(getwd(),"/MetaProViz_Results/dma/", item, sep=""), showWarnings = FALSE)

  #Perform dma
  TvN <- MetaProViz::dma(data =  data[,-c(1:13)],
                   metadata_sample =  data[,c(1:13)],
                   metadata_info = c(Conditions="TISSUE_TYPE", Numerator="TUMOR" , Denominator = "NORMAL"),
                   shapiro=FALSE, #The data have been normalized by the company that provided the results and include metabolites with zero variance as they were all imputed with the same missing value.
                   path = paste(getwd(),"/MetaProViz_Results/dma/", item, sep=""))

  #Add Results to list
  ResList[[item]] <- TvN
}

#> Running dma for TissueType
#> There are no NA/0 values

#> Running dma for EarlyStage
#> There are no NA/0 values

#> Running dma for LateStage
#> There are no NA/0 values

#> Running dma for Old
#> There are no NA/0 values

#> Running dma for Young
#> There are no NA/0 values

We can see from the different Volcano plots have smaller p.adjusted values and differences in Log2FC range.
Here we can also use the MetaproViz::viz_volcano() function to plot comparisons together on the same plot, such as Tumour versus Normal of young and old patients:

# Early versus Late Stage
MetaProViz::viz_volcano(plot_types="Compare",
                       data=ResList[["EarlyStage"]][["dma"]][["TUMOR_vs_NORMAL"]]%>%tibble::column_to_rownames("Metabolite"),
                       data2= ResList[["LateStage"]][["dma"]][["TUMOR_vs_NORMAL"]]%>%tibble::column_to_rownames("Metabolite"),
                       name_comparison= c(data="EarlyStage", data2= "LateStage"),
                       plot_name= "EarlyStage-TUMOR_vs_NORMAL compared to LateStage-TUMOR_vs_NORMAL",
                       subtitle= "Results of dma" )


# Young versus Old
MetaProViz::viz_volcano(plot_types="Compare",
                       data=ResList[["Young"]][["dma"]][["TUMOR_vs_NORMAL"]]%>%tibble::column_to_rownames("Metabolite"),
                       data2= ResList[["Old"]][["dma"]][["TUMOR_vs_NORMAL"]]%>%tibble::column_to_rownames("Metabolite"),
                       name_comparison= c(data="Young", data2= "Old"),
                       plot_name= "Young-TUMOR_vs_NORMAL compared to Old-TUMOR_vs_NORMAL",
                       subtitle= "Results of dma" )

Here we can observe that Tumour versus Normal has lower significance values for the Young patients compared to the Old patients. This can be due to higher variance in the metabolite measurements from Young patients compared to the Old patients.
We can also check if the top changed metabolites comparing Tumour versus Normal correlate with the main metabolite drivers that separate patients based on their TISSUE_TYPE, which are Tumour or Normal.

# Get the top changed metabolites
top_entries <- ResList[["TissueType"]][["dma"]][["TUMOR_vs_NORMAL"]] %>%
  arrange(desc(t.val)) %>%
  slice(1:25)%>%
  select(Metabolite)%>%
  pull()
bottom_entries <- ResList[["TissueType"]][["dma"]][["TUMOR_vs_NORMAL"]] %>%
  arrange(desc(t.val)) %>%
  slice((n()-24):n())%>%
  select(Metabolite)  %>%
  pull()

# Check if those overlap with the top demographics drivers
ggVennDiagram::ggVennDiagram(list(top = top_entries,
                                  Bottom = bottom_entries,
                                  TissueTypeList = TissueTypeList))+
  ggplot2::scale_fill_gradient(low = "blue", high = "red")



MetaData_Metab <- merge(x=tissue_meta,
                   y= MetaRes[["res_summary"]][, c(1,5:6) ]%>%tibble::column_to_rownames("feature"),
                   by=0,
                   all.y=TRUE)%>%
  column_to_rownames("Row.names")

# Make a Volcano plot:
MetaProViz::viz_volcano(plot_types="Standard",
                       data=ResList[["TissueType"]][["dma"]][["TUMOR_vs_NORMAL"]]%>%tibble::column_to_rownames("Metabolite"),
                       metadata_feature =  MetaData_Metab,
                       metadata_info = c(color = "MainDriver_Term"),
                      plot_name= "TISSUE_TYPE-TUMOR_vs_NORMAL",
                       subtitle= "Results of dma" )

Metabolite ID QC

Before we perform enrichment analysis,it is important to check the availability of metabolite IDs that we aim to use. We can first create an overview plot:

#Load the feature metadata:
MetaboliteIDs <- Tissue_MetaData %>%
  rownames_to_column("Metabolite")%>%
  dplyr::rename("HMDB"= "Group_HMDB")%>%
  slice(1:577)# only keep entrys with trivial names

ccRCC_CompareIDs <- MetaProViz::compare_pk(data = list(Biocft = MetaboliteIDs |> dplyr::rename("Class"="SUPER_PATHWAY")),
                                           name_col = "Metabolite",
                                           metadata_info = list(Biocft = c("KEGG", "HMDB", "PUBCHEM")),
                                           plot_name = "Overlap of ID types in ccRCC data")

Here we notice that 76 features have no metabolite ID assigned, yet have a trivial name and a metabolite class assigned. For 135 metabolites we only have a pubchem ID, yet no HMDB or KEGG ID. Next, we will try to understand the missing IDs focusing on HMDb as an example:

Plot1_HMDB <- MetaProViz::count_id(MetaboliteIDs, "HMDB")

Over 200 metabolites have no HMDB ID and if there is a HMDB ID assigned, we only have one HMDB ID. Here we assume the experimental setup does not account for stereoisomers and that other IDs are with different degrees of ambiguity are possible (e.g. L-Alanine HMDB ID is present and we will add D-Alanine HMDB ID, see PK vignette for more details). This enables us to assign additional potential HMDB IDs per feature.

Input_HMDB <- MetaboliteIDs %>%
  dplyr::filter(!is.na(HMDB)) %>% # ID in the measured data we want to use, hence we remove NA's
  dplyr::select("Metabolite", "HMDB", "SUPER_PATHWAY") # only keep relevant columns

# Add equivalent IDs:
tissue_meta_AddIDs_HMDB <- MetaProViz::equivalent_id(data= Input_HMDB,
                                                           metadata_info = c(InputID="HMDB"),# ID in the measured data, here we use the HMDB ID
                                                           from = "hmdb")
#> Warning in MetaProViz::equivalent_id(data = Input_HMDB, metadata_info =
#> c(InputID = "HMDB"), : The following IDs are duplicated and removed: HMDB01859
#> chebi is used to find additional potential IDs for hmdb.
# Let's see how this has changed the number of entries:
tissue_meta_AddIDs_HMDB <- merge(x=MetaboliteIDs,
                                       y=tissue_meta_AddIDs_HMDB,
                                       by="HMDB",
                                       all.x=TRUE)

# Plot 2 after equivalent IDs where added
Plot2_HMDB <- MetaProViz::count_id(tissue_meta_AddIDs_HMDB, "AllIDs")

Additionally, for the metabolites without HMDB ID we can check if a HMDB ID is available using the other ID types. First we check if we have cases in which we do have no HMDB ID, but other available IDs:

Overview of other ID types for metabolites without HMDB ID.
Total_Cases	Not_NA_PUBCHEM	Not_NA_KEGG	Both_Not_NA
160	159	25	24

no_hmdb <- dplyr::filter(MetaboliteIDs, is.na(HMDB) & (!is.na(PUBCHEM) | !is.na(KEGG)))

tissue_meta_translated_HMDB <-
  MetaProViz::translate_id(no_hmdb, metadata_info = list(InputID = "PUBCHEM", grouping_variable = "SUPER_PATHWAY"), from = "pubchem", to = "hmdb") |>
  extract2("TranslatedDF") |>
  rename(hmdb_from_pubchem = hmdb) |>
  MetaProViz::translate_id(metadata_info = list(InputID = "KEGG", grouping_variable = "SUPER_PATHWAY"), from = "kegg", to = "hmdb") |>
  extract2("TranslatedDF") |>
  rename(hmdb_from_kegg = hmdb)
#> Warning in MetaProViz::translate_id(no_hmdb, metadata_info = list(InputID =
#> "PUBCHEM", : The following IDs are duplicated within one group: 5.24601e+13,
#> 6.99312e+13

# Here we combine the tables above, created by equivalent_id and translate_id:
Tissue_MetaData_HMDB <-
  left_join(
    tissue_meta_AddIDs_HMDB |>
      select(Metabolite = Metabolite.x, SUPER_PATHWAY = SUPER_PATHWAY.x, SUB_PATHWAY ,COMP_ID, PLATFORM, RI, MASS, CAS, PUBCHEM, KEGG,  HMDB, hmdb_from_equivalentid = AllIDs),
    tissue_meta_translated_HMDB |>
      select(COMP_ID, hmdb_from_pubchem, hmdb_from_kegg) |> mutate(across(starts_with("hmdb_from"), ~na_if(., ""))),
    by = 'COMP_ID'
  ) |>
  rowwise() |>
  mutate(hmdb_combined = list(unique(na.omit(unlist(stringr::str_split(across(starts_with("hmdb_from")), ",")))))) |>
  mutate(hmdb_combined = paste0(hmdb_combined, collapse = ',')) |>  # we concatenate by "," only for the sake of printing in notebook
  ungroup()

# Plot 3:
Plot3_HMDB <- MetaProViz::count_id(Tissue_MetaData_HMDB, "hmdb_combined")

If we inspect the tables we can check that for example, 16-hydroxypalmitate certainly has an HMDB ID, yet was missing from the original metadata table. We can check the structures associated to it on PubChem:10466 and KEGG:C18218 are indeed identical, and as we expect, the same structure is in HMDB too, under the ID HMDB0006294.

We can of course do the same for other IDs, such as the KEGG IDs:

Plot1_KEGG <- MetaProViz::count_id(MetaboliteIDs, "KEGG")


##################################################################################################################
Input_KEGG <- Tissue_MetaData_HMDB %>%
  dplyr::filter(!is.na(KEGG)) %>% # ID in the measured data we want to use, hence we remove NA's
  dplyr::select("Metabolite", "KEGG", "SUPER_PATHWAY") # only keep relevant columns

# Add equivalent IDs:
tissue_meta_AddIDs_KEGG <- MetaProViz::equivalent_id(data= Input_KEGG,
                                                metadata_info = c(InputID="KEGG"),# ID in the measured data, here we use the HMDB ID
                                                from = "kegg")
#> Warning in MetaProViz::equivalent_id(data = Input_KEGG, metadata_info =
#> c(InputID = "KEGG"), : The following IDs are duplicated and removed: C01585,
#> C00474, C06804, C03594, C06429
#> chebi is used to find additional potential IDs for kegg.

# Let's see how this has changed the number of entries:
tissue_meta_AddIDs_KEGG <- merge(x=Tissue_MetaData_HMDB,
                                 y=tissue_meta_AddIDs_KEGG,
                                 by="KEGG",
                                 all.x=TRUE)

Plot2_KEGG <-  MetaProViz::count_id(tissue_meta_AddIDs_KEGG, "AllIDs")


###################################################################################################################
no_KEGG <- dplyr::filter(Tissue_MetaData_HMDB, is.na(KEGG) & (!is.na(PUBCHEM) | !is.na(HMDB)))

tissue_meta_translated_KEGG <-
  MetaProViz::translate_id(no_KEGG, metadata_info = list(InputID = "PUBCHEM", grouping_variable = "SUPER_PATHWAY"), from = "pubchem", to = "kegg") |>
  extract2("TranslatedDF") |>
  rename(kegg_from_pubchem = kegg) |>
  MetaProViz::translate_id(metadata_info = list(InputID = "HMDB", grouping_variable = "SUPER_PATHWAY"), from = "hmdb", to = "kegg") |>
  extract2("TranslatedDF") |>
  rename(kegg_from_hmdb = kegg)
#> Warning in MetaProViz::translate_id(no_KEGG, metadata_info = list(InputID =
#> "PUBCHEM", : The following IDs are duplicated within one group: 5.24601e+13,
#> 6.99312e+13

# here we combine the tables above, created by equivalent_id and translate_id:
Tissue_MetaData_KEGG <-
  left_join(
    tissue_meta_AddIDs_KEGG |>
      select(Metabolite = Metabolite.x, SUPER_PATHWAY = SUPER_PATHWAY.x,SUB_PATHWAY, COMP_ID, PLATFORM, RI, MASS, CAS, PUBCHEM, HMDB=hmdb_combined , HMDB_Original = HMDB, hmdb_from_equivalentid, hmdb_from_kegg, hmdb_from_pubchem, , KEGG_Original = KEGG, kegg_from_equivalentid = AllIDs),
    tissue_meta_translated_KEGG |>
      select(COMP_ID, kegg_from_pubchem, kegg_from_hmdb) |> mutate(across(starts_with("kegg_from"), ~na_if(., ""))),
    by = 'COMP_ID'
  ) |>
  rowwise() |>
  mutate(KEGG= list(unique(na.omit(unlist(stringr::str_split(across(starts_with("kegg_from")), ",")))))) |>
  mutate(KEGG = paste0(KEGG, collapse = ',')) |>  # we concatenate by "," only for the sake of printing in notebook
  ungroup()

# Lets see the count now:
Plot3_KEGG <- MetaProViz::count_id(Tissue_MetaData_KEGG, "KEGG")

Lastly, we will have the new metadata table, which we can use for enrichment analysis:

Tissue_MetaData_Extended <-  Tissue_MetaData_KEGG[, c(1:10,19, 11:18)]%>%
  mutate(
    KEGG = ifelse(KEGG == "", NA, KEGG),
    HMDB = ifelse(HMDB == "", NA, HMDB)
  )

ccRCC_CompareIDs_Extended <- MetaProViz::compare_pk(data = list(Biocft = Tissue_MetaData_Extended |> dplyr::rename("Class"="SUPER_PATHWAY")),
                                                    name_col = "Metabolite",
                                                    metadata_info = list(Biocft = c("KEGG", "HMDB", "PUBCHEM")),
                                                    plot_name = "Overlap of ID types in ccRCC data")

If we compare the results to the upset plot from the original metadata we can see that initially 251 metabolites had a pubchem, HMDB and KEGG ID, whilst after adding the equivalent IDs and translating the IDs, we now have 285 metabolites with all three IDs. 135 metabolite had only a Pubchem ID, whilst now we have found HMDB and/or KEGG IDs for half of these metabolites with only 65 metabolites remaining with only a Pubchem ID.

ORA

We can perform Over Representation Analysis (ORA) using KEGG pathways for each comparison and plot significant pathways. Noteworthy, since not all metabolites have KEGG IDs, we will lose information.

Given that in some cases we have multiple KEGG IDs for a measured feature, we will check if this causes mapping to multiple, different entries in the KEGG pathway-metabolite sets:

#Load Kegg pathways:
KEGG_Pathways <- metsigdb_kegg()

#check mapping with metadata

ccRCC_to_KEGGPathways <- MetaProViz::checkmatch_pk_to_data(data = Tissue_MetaData_Extended,
                                                           input_pk = KEGG_Pathways,
                                                           metadata_info = c(InputID = "KEGG", PriorID = "MetaboliteID", grouping_variable = "term"))
#> Warning in MetaProViz::checkmatch_pk_to_data(data = Tissue_MetaData_Extended, :
#> 251 NA values were removed from column KEGG
#> Warning in MetaProViz::checkmatch_pk_to_data(data = Tissue_MetaData_Extended, :
#> 8 duplicated IDs were removed from column KEGG
#> data has multiple IDs per measurement = TRUE. input_pk has multiple IDs per entry = FALSE.
#> data has 318 unique entries with 334 unique KEGG IDs. Of those IDs, 250 match, which is 74.8502994011976%.
#> input_pk has 6511 unique entries with 6511 unique MetaboliteID IDs. Of those IDs, 250 are detected in the data, which is 3.83965596682537%.
#> Warning in MetaProViz::checkmatch_pk_to_data(data = Tissue_MetaData_Extended, :
#> There are cases where multiple detected IDs match to multiple prior knowledge
#> IDs of the same category

problems_terms <- ccRCC_to_KEGGPathways[["GroupingVariable_summary"]]%>%
  filter(!Group_Conflict_Notes== "None")

Terms in KEGG pathways where the same measured feature mapps with more than one ID, which will inflate the enrichment analysis.
KEGG	original_count	matches_count	MetaboliteID	term
C00221, C00031	2	2	C00221	Metabolic pathways
C00221, C00031	2	2	C00221	Pentose phosphate pathway
C00221, C00031	2	2	C00221	Biosynthesis of secondary metabolites
C00221, C00031	2	2	C00031	Glycolysis / Gluconeogenesis
C00221, C00031	2	2	C00031	Pentose phosphate pathway
C00221, C00031	2	2	C00031	Metabolic pathways
C00221, C00031	2	2	C00031	Biosynthesis of secondary metabolites
C00221, C00031	2	2	C00221	Glycolysis / Gluconeogenesis
C03460, C03722	2	2	C03460	Metabolic pathways
C03460, C03722	2	2	C03722	Metabolic pathways
C17737, C00695	2	2	C17737	Secondary bile acid biosynthesis
C17737, C00695	2	2	C00695	Secondary bile acid biosynthesis

Dependent on the biological question and the organism and prior knowledge, one can either maintain the metabolite ID of the more likely metabolite (e.g. in human its moire likely that we have L-aminoacid than D-aminoacid) or the metabolite ID that is represented in more/less pathways (specificity). If we are looking into the cases where we do have multiple IDs, for cases with no match to the prior knowledge we can just maintain one ID, whilst for cases with exactly one match we should maintain the ID that is found in the prior knowledge.

# Select the cases where a feature has multiple IDs
multipleIDs <- ccRCC_to_KEGGPathways[["data_summary"]]%>%
  filter(original_count>1)

Terms in KEGG pathways where the same measured feature mapps with more than one ID, which may inflate the enrichment analysis.
KEGG	InputID_select	original_count	matches_count	matches	Group_Conflict_Notes	ActionRequired	Action_Specific
C00065, C00716	NA	2	2	C00065, C00716	None	Check	KeepEachID
C00155, C05330	C00155	2	1	C00155	None	None	None
C00221, C00031	NA	2	2	C00221, C00031	None \|\| Metabolic pathways \|\| Pentose phosphate pathway \|\| Biosynthesis of secondary metabolites \|\| Glycolysis / Gluconeogenesis	Check	KeepOneID
C00671, C06008	C00671	2	1	C00671	None	None	None
C01835, C00721, C00420	NA	3	2	C01835, C00721	None	Check	KeepEachID
C01991, C00989	C00989	2	1	C00989	None	None	None
C02052, C00464	C00464	2	1	C00464	None	None	None
C03460, C03722	NA	2	2	C03460, C03722	None \|\| Metabolic pathways	Check	KeepOneID
C07202,D02358	C07202	2	0	NA	None	None	None
C17737, C00695	NA	2	2	C17737, C00695	None \|\| Secondary bile acid biosynthesis	Check	KeepOneID
D00095,C00788,D02149	C00788	3	1	C00788	None	None	None
D00567,C07589	D00567	2	0	NA	None	None	None
D00887,C06834	D00887	2	0	NA	None	None	None
D02362,C07415	D02362	2	0	NA	None	None	None

In case of ActionRequired=="Check", we can look into the column Action_Specific which contains additional information. In case of the entry KeepEachID, multiple matches to the prior knowledge were found, yet the features are in different pathways (=GroupingVariable). Yet, in case of KeepOneID, the different IDs map to the same pathway in the prior knowledge for at least one case and therefore keeping both would inflate the enrichment analysis.

SelectedIDs <- ccRCC_to_KEGGPathways[["data_summary"]]%>%
  #Expand rows where Action == KeepEachID by splitting `matches`
  dplyr::mutate(matches_split = if_else(Action_Specific == "KeepEachID", matches, NA_character_)) %>%
  separate_rows(matches_split, sep = ",\\s*") %>%
  mutate(InputID_select = if_else(Action_Specific  == "KeepEachID", matches_split, InputID_select)) %>%
  select(-matches_split) %>%
  #Select one ID for AcionSpecific==KeepOneID
  dplyr::mutate(InputID_select = case_when(
    Action_Specific == "KeepOneID" & matches == "C03460, C03722" ~ "C03722", # 2-Methylprop-2-enoyl-CoA versus Quinolinate. No evidence, hence we keep the one present in more pathways ( C03722=7 pathways, C03460=2 pathway)
    Action_Specific == "KeepOneID" & matches ==  "C00221, C00031" ~ "C00031", # These are D- and L-Glucose. We have human samples, so in this conflict we will maintain L-Glucose
    Action_Specific == "KeepOneID" & matches ==  "C17737, C00695" ~ "C00695", # Allocholic acid versus Cholic acid. No evidence, hence we keep the one present in more pathways (C00695 = 4 pathways, C17737 = 1 pathway)
    Action_Specific == "KeepOneID" ~ InputID_select,  # Keep NA where not matched manually
    TRUE ~ InputID_select
  ))

Lastly, we need to add the column including our selected IDs to the metadata table:

Tissue_MetaData_Extended <- merge(x= SelectedIDs %>%
                                        dplyr::select(KEGG, InputID_select),
                                  y= Tissue_MetaData_Extended,
                                  by= "KEGG",
                                  all.y=TRUE)

For results with p.adjusted value < 0.1 and a minimum of 10% of the pathway detected will be visualized as Volcano plots:

# Since we have performed multiple comparisons (all_vs_HK2), we will run ORA for each of this comparison
DM_ORA_res<- list()

for(comparison in names(ResList)){#Res list includes the different comparisons we performed above <-
  #Ensure that the Metabolite names match with KEGG IDs or KEGG trivial names.
  dma_res <- merge(x= Tissue_MetaData_Extended,
                   y= ResList[[comparison]][["dma"]][["TUMOR_vs_NORMAL"]], 
                   by="Metabolite", 
                   all=TRUE)
  
  #Ensure unique IDs and full background --> we include measured features that do not have a KEGG ID.
  dma_res <- dma_res[,c(3,21:25)]%>%
    dplyr::mutate(InputID_select = if_else(
      is.na(InputID_select),
      paste0("NA_", cumsum(is.na(InputID_select))),
      InputID_select
      ))%>% #remove duplications and keep the higher Log2FC measurement
    group_by(InputID_select) %>%
    slice_max(order_by = Log2FC, n = 1, with_ties = FALSE) %>%
    ungroup()%>%
    remove_rownames()%>%
    tibble::column_to_rownames("InputID_select")
  
  #Perform ORA
  Res <- MetaProViz::standard_ora(data= dma_res, #Input data requirements: column `t.val` and column `Metabolite`
                                 metadata_info=c(pvalColumn="p.adj", percentageColumn="t.val", PathwayTerm= "term", PathwayFeature= "MetaboliteID"),
                                 input_pathway=KEGG_Pathways,#Pathway file requirements: column `term`, `Metabolite` and `Description`. Above we loaded the Kegg_Pathways using MetaProViz::Load_KEGG()
                                 pathway_name=paste0("KEGG_", comparison, sep=""),
                                 min_gssize=3,
                                 max_gssize=1000,
                                 cutoff_stat=0.01,
                                 cutoff_percentage=10)

  DM_ORA_res[[comparison]] <- Res

  #Select to plot:
  Res_Select <- Res[["ClusterGosummary"]]%>%
    filter(p.adjust<0.1)%>%
    #filter(pvalue<0.05)%>%
    filter(percentage_of_Pathway_detected>10)

  if(is.null(Res_Select)==FALSE){
     MetaProViz::viz_volcano(plot_types="PEA",
                       data= dma_res, #Must be the data you have used as an input for the pathway analysis
                       data2=as.data.frame(Res_Select )%>%dplyr::rename("term"="ID"),
                       metadata_info= c(PEA_Pathway="term",# Needs to be the same in both, metadata_feature and data2.
                                       PEA_stat="p.adjust",#Column data2
                                       PEA_score="GeneRatio",#Column data2
                                       PEA_Feature="MetaboliteID"),# Column metadata_feature (needs to be the same as row names in data)
                       metadata_feature= KEGG_Pathways,#Must be the pathways used for pathway analysis
                       plot_name= paste("KEGG_", comparison, sep=""),
                       subtitle= "PEA" )
  }
}

Biological regulated clustering

to understand which metabolites are changing independent of the patients age, hence only due to tumour versus normal, and which metabolites change independent of tumour versus normal, hence due to the different age, we can use the MetaProViz::mca_2cond() function.
Metabolite Clustering Analysis (MCA) enables clustering of metabolites into groups based on logical regulatory rules. Here we set two different thresholds, one for the differential metabolite abundance (Log2FC) and one for the significance (e.g. p.adj). This will define if a feature (= metabolite) is assigned into:
1. “UP”, which means a metabolite is significantly up-regulated in the underlying comparison.
2. “DOWN”, which means a metabolite is significantly down-regulated in the underlying comparison.
3. “No Change”, which means a metabolite does not change significantly in the underlying comparison and/or is not defined as up-regulated/down-regulated based on the Log2FC threshold chosen.

Thereby “No Change” is further subdivided into four states:
1. “Not Detected”, which means a metabolite is not detected in the underlying comparison.
2. “Not Significant”, which means a metabolite is not significant in the underlying comparison.
3. “Significant positive”, which means a metabolite is significant in the underlying comparison and the differential metabolite abundance is positive, yet does not meet the threshold set for “UP” (e.g. Log2FC >1 = “UP” and we have a significant Log2FC=0.8).
4. “Significant negative”, which means a metabolite is significant in the underlying comparison and the differential metabolite abundance is negative, yet does not meet the threshold set for “DOWN”.

For more information you can also check out the other vignettes.

MCAres <-  MetaProViz::mca_2cond(data_c1=ResList[["Young"]][["dma"]][["TUMOR_vs_NORMAL"]],
                                 data_c2=ResList[["Old"]][["dma"]][["TUMOR_vs_NORMAL"]],
                                 metadata_info_c1=c(ValueCol="Log2FC",StatCol="p.adj", cutoff_stat= 0.05, ValueCutoff=1),
                                 metadata_info_c2=c(ValueCol="Log2FC",StatCol="p.adj", cutoff_stat= 0.05, ValueCutoff=1),
                                 feature = "Metabolite",
                                 save_table = "csv",
                                 method_background="C1&C2"#Most stringend background setting, only includes metabolites detected in both comparisons
                                 )

Now we can use this information to colour code our volcano plot. We will plot individual vocano plots for each metabolite pathway as defined by the feature metadata provided as part of the data in (Hakimi et al. 2016).

# Add metabolite information such as KEGG ID or pathway to results
MetaData_Metab <- merge(x=Tissue_MetaData,
                   y= MCAres[["MCA_2Cond_Results"]][, c(1, 14:15)]%>%tibble::column_to_rownames("Metabolite"),
                   by=0,
                   all.y=TRUE)%>%
  tibble::column_to_rownames("Row.names")

MetaProViz::viz_volcano(plot_types="Compare",
                       data=ResList[["Young"]][["dma"]][["TUMOR_vs_NORMAL"]]%>%tibble::column_to_rownames("Metabolite"),
                       data2= ResList[["Old"]][["dma"]][["TUMOR_vs_NORMAL"]]%>%tibble::column_to_rownames("Metabolite"),
                       name_comparison= c(data="Young", data2= "Old"),
                       metadata_feature =  MetaData_Metab,
                      plot_name= "Young-TUMOR_vs_NORMAL compared to Old-TUMOR_vs_NORMAL",
                       subtitle= "Results of dma",
                       metadata_info = c(individual = "SUPER_PATHWAY",
                                        color = "RG2_Significant"))

MetaProViz::viz_volcano(plot_types="Compare",
                       data=ResList[["Young"]][["dma"]][["TUMOR_vs_NORMAL"]]%>%tibble::column_to_rownames("Metabolite"),
                       data2= ResList[["Old"]][["dma"]][["TUMOR_vs_NORMAL"]]%>%tibble::column_to_rownames("Metabolite"),
                       name_comparison= c(data="Young", data2= "Old"),
                       metadata_feature =  MetaData_Metab,
                      plot_name= "Young-TUMOR_vs_NORMAL compared to Old-TUMOR_vs_NORMAL_Sub",
                       subtitle= "Results of dma",
                       metadata_info = c(individual = "SUB_PATHWAY",
                                        color = "RG2_Significant"))

Pathway enrichment

Next, we perform Over Representation Analysis (ORA) using KEGG pathways for each comparison.

# Since we have performed multiple comparisons (all_vs_HK2), we will run ORA for each of this comparison
DM_ORA_res<- list()

KEGG_Pathways <- MetaProViz::metsigdb_kegg()

for(comparison in names(ResList)){
  dma_res <- merge(x= Tissue_MetaData_Extended,
                   y=ResList[[comparison]][["dma"]][["TUMOR_vs_NORMAL"]], 
                   by="Metabolite", 
                   all=TRUE)
  
  #Ensure unique IDs and full background --> we include measured features that do not have a KEGG ID.
  dma_res <- dma_res[,c(3,21:25)]%>%
    dplyr::mutate(InputID_select = if_else(
      is.na(InputID_select),
      paste0("NA_", cumsum(is.na(InputID_select))),
      InputID_select
      ))%>% #remove duplications and keep the higher Log2FC measurement
    group_by(InputID_select) %>%
    slice_max(order_by = Log2FC, n = 1, with_ties = FALSE) %>%
    ungroup()%>%
    remove_rownames()%>%
    tibble::column_to_rownames("InputID_select")
  
  
  #Perform ORA
  Res <- MetaProViz::standard_ora(data= dma_res, #Input data requirements: column `t.val` and column `Metabolite`
                                 metadata_info=c(pvalColumn="p.adj", percentageColumn="t.val", PathwayTerm= "term", PathwayFeature= "MetaboliteID"),
                                 input_pathway=KEGG_Pathways,#Pathway file requirements: column `term`, `Metabolite` and `Description`. Above we loaded the Kegg_Pathways using MetaProViz::Load_KEGG()
                                 pathway_name=paste0("KEGG_", comparison, sep=""),
                                 min_gssize=3,
                                 max_gssize=1000,
                                 cutoff_stat=0.01,
                                 cutoff_percentage=10)

  DM_ORA_res[[comparison]] <- Res

  #Select to plot:
  Res_Select <- Res[["ClusterGosummary"]]%>%
    filter(p.adjust<0.1)%>%
    #filter(pvalue<0.05)%>%
    filter(percentage_of_Pathway_detected>10)

  if(is.null(Res_Select)==FALSE){
     MetaProViz::viz_volcano(plot_types="PEA",
                       data= dma_res, #Must be the data you have used as an input for the pathway analysis
                       data2=as.data.frame(Res_Select )%>%dplyr::rename("term"="ID"),
                       metadata_info= c(PEA_Pathway="term",# Needs to be the same in both, metadata_feature and data2.
                                       PEA_stat="p.adjust",#Column data2
                                       PEA_score="GeneRatio",#Column data2
                                       PEA_Feature="MetaboliteID"),# Column metadata_feature (needs to be the same as row names in data)
                       metadata_feature= KEGG_Pathways,#Must be the pathways used for pathway analysis
                      plot_name= paste("KEGG_", comparison, sep=""),
                       subtitle= "PEA" )
  }
}

Session information

#> R version 4.4.3 (2025-02-28)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] stringr_1.5.1    tibble_3.3.0     tidyr_1.3.1      rlang_1.1.6      dplyr_1.1.4      magrittr_2.0.3  
#> [7] MetaProViz_3.0.3
#> 
#> loaded via a namespace (and not attached):
#>   [1] DBI_1.2.3              gridExtra_2.3          httr2_1.2.0            tcltk_4.4.3            logger_0.4.0          
#>   [6] readxl_1.4.5           compiler_4.4.3         RSQLite_2.4.1          systemfonts_1.2.3      vctrs_0.6.5           
#>  [11] reshape2_1.4.4         rvest_1.0.4            pkgconfig_2.0.3        crayon_1.5.3           fastmap_1.2.0         
#>  [16] backports_1.5.0        labeling_0.4.3         rmarkdown_2.29         sessioninfo_1.2.3      tzdb_0.5.0            
#>  [21] ggbeeswarm_0.7.2       ragg_1.4.0             ggfortify_0.4.18       purrr_1.1.0            bit_4.6.0             
#>  [26] xfun_0.52              cachem_1.1.0           jsonlite_2.0.0         progress_1.2.3         blob_1.2.4            
#>  [31] later_1.4.2            broom_1.0.8            parallel_4.4.3         prettyunits_1.2.0      R6_2.6.1              
#>  [36] bslib_0.9.0            stringi_1.8.7          RColorBrewer_1.1-3     ComplexUpset_1.3.3     limma_3.65.1          
#>  [41] car_3.1-3              lubridate_1.9.4        jquerylib_0.1.4        cellranger_1.1.0       Rcpp_1.1.0            
#>  [46] knitr_1.50             R.utils_2.13.0         readr_2.1.5            splines_4.4.3          igraph_2.1.4          
#>  [51] timechange_0.3.0       tidyselect_1.2.1       rstudioapi_0.17.1      qvalue_2.38.0          abind_1.4-8           
#>  [56] yaml_2.3.10            ggVennDiagram_1.5.4    curl_6.4.0             plyr_1.8.9             withr_3.0.2           
#>  [61] inflection_1.3.6       evaluate_1.0.4         desc_1.4.3             zip_2.3.3              xml2_1.3.8            
#>  [66] pillar_1.11.0          ggpubr_0.6.1           carData_3.0-5          checkmate_2.3.2        generics_0.1.4        
#>  [71] vroom_1.6.5            hms_1.1.3              ggplot2_3.5.2          scales_1.4.0           gtools_3.9.5          
#>  [76] OmnipathR_3.17.4       glue_1.8.0             pheatmap_1.0.13        scatterplot3d_0.3-44   tools_4.4.3           
#>  [81] ggsignif_0.6.4         fs_1.6.6               XML_3.99-0.18          grid_4.4.3             qcc_2.7               
#>  [86] colorspace_2.1-1       patchwork_1.3.1        beeswarm_0.4.0         vipor_0.4.7            Formula_1.2-5         
#>  [91] cli_3.6.5              rappdirs_0.3.3         kableExtra_1.4.0       textshaping_1.0.1      Polychrome_1.5.4      
#>  [96] viridisLite_0.4.2      svglite_2.2.1          gtable_0.3.6           R.methodsS3_1.8.2      rstatix_0.7.2         
#> [101] hash_2.2.6.3           selectr_0.4-2          EnhancedVolcano_1.24.0 sass_0.4.10            digest_0.6.37         
#> [106] ggrepel_0.9.6          htmlwidgets_1.6.4      farver_2.1.2           memoise_2.0.1          htmltools_0.5.8.1     
#> [111] pkgdown_2.1.3          R.oo_1.27.1            factoextra_1.0.7       lifecycle_1.0.4        httr_1.4.7            
#> [116] statmod_1.5.0          bit64_4.6.0-1          MASS_7.3-64

Bibliography

Hakimi, A Ari, Ed Reznik, Chung-Han Lee, Chad J Creighton, A Rose Brannon, Augustin Luna, B Arman Aksoy, et al. 2016. “An Integrated Metabolic Atlas of Clear Cell Renal Cell Carcinoma.” Cancer Cell 29 (1): 104–16. https://doi.org/10.1016/j.ccell.2015.12.004.

Christina Schmidt