Skip to contents


In this tutorial we showcase how to use MetaProViz prior knowledge:
- 1. To understand detected metabolite IDs in measured data.

- 2. To access metabolite prior knowledge and metabolite-gene prior knowledge networks.

- 3. To link experimental data to prior knowledge - the Do’s and Don’ts.

- 4. To deal with many-to-many mapping in your metabolite identifiers.

- 5. To perform pathway enrichment analysis.


First if you have not done yet, install the required dependencies and load the libraries:

# 1. Install Rtools if you haven’t done this yet, using the appropriate version (e.g.windows or macOS).
# 2. Install the latest development version from GitHub using devtools
# devtools::install_github("https://github.com/saezlab/MetaProViz")

library(MetaProViz)
library(magrittr)
library(rlang)
library(purrr)
library(dplyr)
library(stringr)

# Please install the Biocmanager Dependencies:
# BiocManager::install("clusterProfiler")
# BiocManager::install("EnhancedVolcano")
# BiocManager::install("cosmosR")



Loading the example data


As part of the MetaProViz package you can load the example feature metadata using the function toy_data():
1. Metadata of cell line experiment (CellLine)
Example datasets is publicly available on metabolomics workbench project PR001418 including metabolic profiles of human renal epithelial cells HK2 and cell renal cell carcinoma (ccRCC) cell lines cultured in Plasmax cell culture media (Sciacovelli et al. 2022).

# Load the Pre-processed intracellular data:
FeatureMetadata_Cells <- MetaProViz::ToyData(Data="Cells_MetaData")


2. Metadata of patients tissue experiment (Tissue)
Example datasets is publicly available in the paper “An Integrated Metabolic Atlas of Clear Cell Renal Cell Carcinoma”, which includes metabolomic profiling on 138 matched clear cell renal cell carcinoma (ccRCC)/normal tissue pairs.

# Load the example data:
FeatureMetadata_Tissue <- MetaProViz::ToyData(Data="Tissue_MetaData")


3. Metadata of Biocrates kit (Biocrates)
Here we use the Biocrates kit feature information of the “MxP® Quant 500 XL kit” that covers more than 1,000 metabolites from various biochemical classes.

# Load the example data:
FeatureMetadata_Biocrates <- MetaProViz::ToyData(Data="BiocratesFeatureTable")



1. Metabolite IDs in measured data

Assigning Metabolite IDs to measured data
The difficulty with assigning metabolite IDs to measured data is the uncertainty in the detection of metabolites. Indeed, differentiation of structural isomers (both constitutional isomers and stereoisomers) as for example the distinction between enantiomers. This leads to loss of information and hence uncertainty is assigning metabolite IDs.
One example is the metabolite Alanine, which can occur in its L- or D- form. If in an experiment those enantiomers have not been distinguished, the correct way would be to either assign two metabolite IDs (L- and D-Alanine) or a more general Alanine ID without chiral information. Yet, in reality this is not as trivial:
Available Alanine IDs in HMDB and ChEBI.
TrivialName HMDB ChEBI KEGG PubChem
D-Alanine HMDB0001310 15570 C00133 71080
L-Alanine HMDB0000161 16977 C00041 5950
Alanine NA 16449 C01401 602
Alanine zwitterion NA 66916 NA 57383916

Indeed, dependent on the database, the Alanine metabolite can have different IDs available:

For instance, if we want to assign a HMDB ID, we have to assign both “HMDB0001310”, “HMDB0000161” to the metabolite Alanine, for ChEBI we could assign only one, “16449”, but this may lead to other problems as the ChEBI ID is not specific and may not be part of certain metabolic pathways. The reason for this is that substrate chirality is critical to enzymatic processes and stereo selectivity of enzymes to be homochiral with predominance of one particular enantiomer (e.g. D-sugars, L-amino acids, etc.).
To showcase the severity of this problem, we can look at the occurrence of those metabolites in metabolic pathways across different databases. To do so we searched for those metabolite IDs in the RaMP database (Braisted et al. 2023) and extracted the pathways they are part of:
Alanine IDs in HMDB and ChEBI mapped to pathways from wiki, KEGG and Reactome using RamP.
TrivialName ID Database PathwayCount
L-Alanine 16977 ChEBI 44
L-Alanine 5950 PubChem 44
L-Alanine C00041 KEGG 44
L-Alanine HMDB0000161 HMDB 44
D-Alanine 15570 ChEBI 3
D-Alanine 71080 PubChem 3
D-Alanine C00133 KEGG 3
D-Alanine HMDB0001310 HMDB 3
Alanine zwitterion 57383916 PubChem 0
Alanine zwitterion 66916 ChEBI 0
Alanine 16449 ChEBI 4
Alanine 602 PubChem 4
Alanine C01401 KEGG 4




This showcases if we choose the ChEBI ID for Alanine (ChEBI ID 16449), if experimentally the distinction was not possible, we will not map to any pathway even though the metabolite is part of many pathways. Hence, we recommend to assign multiple IDs to a measured ID, where specificity in detection is not given.

Yet, many metabolomics studies do not report multiple IDs, but rather one ID per measured peak. In some cases the chosen ID is correct in representing the degree of ambiguity in the detection, e.g. “Alanine-Zwitterion”, whilst in other cases a specific ID, which has not been detected in this specificity, e.g. “L-Alanine”. In both cases this can lead to missed maps to prior knowledge and hence a loss of information as discussed above.
Hence, we assign multiple IDs to a measured peak, where specificity in detection may not be given (i.e. amino acids).

# Example Cell-line data:
Input_HMDB <- FeatureMetadata_Cells%>%
  dplyr::filter(!HMDB == "NA")%>% # ID in the measured data we want to use, hence we remove NA's
  dplyr::select("HMDB", "Pathway") # only keep relevant columns

# Add equivalent IDs:
FeatureMetadata_Cells_AddIDs <- MetaProViz::EquivalentIDs(InputData= Input_HMDB,
                                                          SettingsInfo = c(InputID="HMDB"),# ID in the measured data, here we use the HMDB ID
                                                          From = "hmdb")
#> Warning in MetaProViz::EquivalentIDs(InputData = Input_HMDB, SettingsInfo =
#> c(InputID = "HMDB"), : The following IDs are duplicated and removed:
#> HMDB0000725, HMDB0002013, HMDB0000267, HMDB0000755
#> chebi is used to find additional potential IDs for hmdb.
Preview of the DF FeatureMetadata_Cells_AddIDs including new columns with potential additional HMDB IDs assign using MetaProViz::EquivalentIDs() function.
HMDB Pathway PotentialAdditionalIDs
HMDB0000014 Pyrimidine metabolism
HMDB0000033 Not assigned
HMDB0000043 Not assigned HMDB0059606
HMDB0000045 Purine metabolism
HMDB0000050 Purine metabolism HMDB0004401, HMDB0004402, HMDB0004421
HMDB0000052 Alanine, aspartate and glutamate metabolism HMDB0001006
HMDB0000056 Pyrimidine metabolism

2. Accessing Prior Knowledge

Metabolite prior knowledge (PK) is essential for the interpretation of metabolomics data. It can be used to perform pathway enrichment analysis, compound class enrichment analysis, and by using specific PK databases, it can be used to study the connection of metabolites and receptors or transporters.Since the quality and content of the PK will dictate the success of the downstream analysis and biological interpretation, it is important to ensure the PK is used correctly.
Specifically in metabolite PK, the many different PK databases and resources pose several issues. Indeed, the metabolite identifiers (e.g. KEGG, HMDB, PubChem, etc.) are not standardized across databases, and the same metabolite can have multiple identifiers in different databases. This is known as the many-to-many mapping problem. Moreover, metabolic pathways that are the basis of the PK databases also includes co-factors such as ions or other small molecules that are not only part of most reactions, but can also not be detected in experimentally acquired data (e.g. H2O, CO2, etc).
To address these issues and provide a standardized way to access and integrate metabolite PK, we provide access to several prior knowledge resources and removed molecules such as water. We terme this collection of metabolite sets MetSigDB (Metabolite signature database).

KEGG pathway-metabolite sets

KEGG pathways that are loaded via KEGG API using the package KEGGREST and can be used to perform pathway analysis (Kanehisa and Goto 2000). (KEGG_Pathways)

# This will use KEGGREST to query the KEGG API to load the pathways:
MetaProViz::LoadKEGG()

Preview of the DF KEGG_Pathways.
Description MetaboliteID term Metabolite pubchem compound_names
map00010 C00022 Glycolysis / Gluconeogenesis Pyruvate 3324 Pyruvate , Pyruvic acid , 2-Oxopropanoate , 2-Oxopropanoic acid, Pyroracemic acid
map00010 C00024 Glycolysis / Gluconeogenesis Acetyl-CoA 3326 Acetyl-CoA , Acetyl coenzyme A
map00010 C00031 Glycolysis / Gluconeogenesis D-Glucose 3333 D-Glucose , Grape sugar , Dextrose , Glucose , D-Glucopyranose
map00030 C00022 Pentose phosphate pathway Pyruvate 3324 Pyruvate , Pyruvic acid , 2-Oxopropanoate , 2-Oxopropanoic acid, Pyroracemic acid
map00030 C00031 Pentose phosphate pathway D-Glucose 3333 D-Glucose , Grape sugar , Dextrose , Glucose , D-Glucopyranose
map00030 C00085 Pentose phosphate pathway D-Fructose 6-phosphate 3385 D-Fructose 6-phosphate , D-Fructose 6-phosphoric acid, Neuberg ester

Chemical class-metabolite sets

The chemical class-metabolite sets are based on the classification of metabolites into chemical classes, which can be used to perform compound class enrichment analysis.
The chemical class-metabolite sets were curated by RaMP-DB, which used ClassyFire (Braisted et al. 2023). Here we access them via OmnipathR.

MetaProViz::LoadRAMP()
#> Cached file loaded from: ~/.cache/RaMP-ChemicalClass_Metabolite.rds

Preview of the DF ChemicalClass_MetabSet.
class_source_id common_name ClassyFire_class ClassyFire_super_class ClassyFire_sub_class
HMDB0000001 3-Methylhistidine; 1-Methylhistidine Carboxylic acids and derivatives Organic acids and derivatives Amino acids, peptides, and analogues
HMDB0000479 3-Methylhistidine; 1-Methylhistidine Carboxylic acids and derivatives Organic acids and derivatives Amino acids, peptides, and analogues
HMDB0000005 2-Ketobutyric acid Keto acids and derivatives Organic acids and derivatives Short-chain keto acids and derivatives
HMDB0000008 2-Hydroxybutyric acid Hydroxy acids and derivatives Organic acids and derivatives Alpha hydroxy acids and derivatives
HMDB0000010 2-Methoxyestrone Steroids and steroid derivatives Lipids and lipid-like molecules Estrane steroids
HMDB0000163 D-Maltose Organooxygen compounds Organic oxygen compounds Carbohydrates and carbohydrate conjugates
HMDB0000169 D-Mannose Organooxygen compounds Organic oxygen compounds Carbohydrates and carbohydrate conjugates
HMDB0000172 L-Isoleucine Carboxylic acids and derivatives Organic acids and derivatives Amino acids, peptides, and analogues
HMDB0000174 L-Fucose Organooxygen compounds Organic oxygen compounds Carbohydrates and carbohydrate conjugates
HMDB0004940 N-Glycoloylganglioside GM1; Ganglioside GM2 (d18:1/20:0); N-Glycoloylganglioside GM2 Sphingolipids Lipids and lipid-like molecules Glycosphingolipids
HMDB0000772 Nonadecanoic acid Fatty Acyls Lipids and lipid-like molecules Fatty acids and conjugates
HMDB0000780 Porphinehexacarboxylic acid Tetrapyrroles and derivatives Organoheterocyclic compounds Porphyrins
HMDB0000781 N-Acetylgalactosamine 4-sulphate Organooxygen compounds Organic oxygen compounds Carbohydrates and carbohydrate conjugates

Create pathway-metabolite sets

The function Make_GeneMetabSet can be used to convert gene names to metabolite names by using a PK network of metabolic reactions calls CosmosR (Dugourd et al. 2021). This function is useful if you want to perform pathway enrichment analysis on available gene-sets such as the Hallmarks gene-sets from MSigDB (Castanza et al. 2022). Moreover, it enables you to perform combined pathway enrichment analysis on metabolite-gene sets, if you have other data types such as proteomics measuring the enzymes expression.
The Hallmarks (Liberzon et al. 2015) gene-set and Gaude (Gaude and Frezza 2016) gene-set is available in the package MetaProViz and can be loaded using the function LoadHallmarks() and LoadGaude respectively.

# Load the example gene-sets:
MetaProViz::LoadHallmarks()

MetaProViz::LoadGaude()

Preview of the DF Hallmarks_Pathways including gene-sets usable for pathway enrichment analysis.
class_source_id common_name ClassyFire_class ClassyFire_super_class ClassyFire_sub_class
HMDB0001886 3-Methylxanthine Imidazopyrimidines Organoheterocyclic compounds Purines and purine derivatives
HMDB0001888 N,N-Dimethylformamide Carboxylic acids and derivatives Organic acids and derivatives Carboxylic acid derivatives
HMDB0007484 DG(20:3(8Z,11Z,14Z)/20:0/0:0) Glycerolipids Lipids and lipid-like molecules Diradylglycerols
HMDB0007485 DG(20:3(8Z,11Z,14Z)/20:1(11Z)/0:0) Glycerolipids Lipids and lipid-like molecules Diradylglycerols
HMDB0007484 DG(20:3(8Z,11Z,14Z)/20:0/0:0) Glycerolipids Lipids and lipid-like molecules Diradylglycerols
HMDB0007485 DG(20:3(8Z,11Z,14Z)/20:1(11Z)/0:0) Glycerolipids Lipids and lipid-like molecules Diradylglycerols
HMDB0007486 DG(20:3(8Z,11Z,14Z)/20:2(11Z,14Z)/0:0) Glycerolipids Lipids and lipid-like molecules Diradylglycerols

Preview of the DF gaude_Pathways including gene-sets usable for pathway enrichment analysis.
gene term UniqueGene
CPT1A Carnitine shuttle Unique
CPT1B Carnitine shuttle Unique
CPT1C Carnitine shuttle Unique
CPT2 Carnitine shuttle Unique
CRAT Carnitine shuttle Unique
SLC22A4 Carnitine shuttle In multiple Pathways
SLC22A5 Carnitine shuttle In multiple Pathways
SLC25A20 Carnitine shuttle In multiple Pathways
ACO1 Citric Acid Cycle Unique
ACO2 Citric Acid Cycle Unique
GOT1 Citric Acid Cycle In multiple Pathways


Now we can use the function Make_GeneMetabSet to translate the gene names to metabolite names.

# Translate gene names to metabolite names
Hallmarks_GeneMetab <- MetaProViz::Make_GeneMetabSet(Input_GeneSet=Hallmark_Pathways,
                                                     SettingsInfo=c(Target="gene"),
                                                     PKName="Hallmarks")

Gaude_GeneMetab <- MetaProViz::Make_GeneMetabSet(Input_GeneSet=Gaude_Pathways,
                                                     SettingsInfo=c(Target="gene"),
                                                     PKName="Gaude")

Preview of the DF Hallmarks_GeneMetab including gene-sets usable for pathway enrichment analysis.
term feature
HALLMARK_GLYCOLYSIS ME2
HALLMARK_GLYCOLYSIS LDHC
HALLMARK_GLYCOLYSIS FKBP4
HALLMARK_GLYCOLYSIS HMDB0000241
HALLMARK_GLYCOLYSIS HMDB0000570
HALLMARK_GLYCOLYSIS HMDB0000122



Given that we have the gene-metabolite-sets, we can now also run enrichment analysis on combined data types, once including the metabolite Log2FC and one including gene Log2FC from e.g. transcriptomics or proteomics data. Yet, it is important to keep in mind that generally we detect less metabolites than genes and hence this may bias the results obtained from combined enrichment analysis.

MetaLinksDB metabolite-receptor & metabolite-transporter sets

The MetaLinks database is a manually curated database of metabolite-receptor and metabolite-transporter sets that can be used to study the connection of metabolites and receptors or transporters (Farr et al. 2024).

MetaLinksDB_Res <- MetaProViz::LoadMetalinks()

Preview of the DF MetaLinksDB including metabolite-receptor sets.
hmdb metabolite pubchem metabolite_subclass uniprot gene_symbol protein_type source db_score experiment_score combined_score mor type transport_direction mode_of_regulation
HMDB0002111 Water 962 NA A0A087X1C5 CYP2D7 NA rhea NA NA NA 1 Production-Degradation NA Activating
HMDB0001377 Oxygen 977 Other non-metal oxides A0A087X1C5 CYP2D7 NA rhea NA NA NA -1 Production-Degradation NA Inhibiting
HMDB0001377 Oxygen 977 Other non-metal oxides A0A1W2PPD8 KDM4F NA rhea NA NA NA -1 Production-Degradation NA Inhibiting
HMDB0001967 Carbon dioxide 280 Other non-metal oxides A0A1W2PPD8 KDM4F NA rhea NA NA NA 1 Production-Degradation NA Activating
HMDB0002111 Water 962 NA A0A1W2PQ27 SSU72L1 NA rhea NA NA NA -1 Production-Degradation NA Inhibiting
HMDB0002111 Water 962 NA A0A1W2PQ64 SSU72L5 NA rhea NA NA NA -1 Production-Degradation NA Inhibiting
HMDB0002111 Water 962 NA A0A1W2PQC6 SSU72L4 NA rhea NA NA NA -1 Production-Degradation NA Inhibiting
HMDB0002111 Water 962 NA A0A1W2PQD8 SSU72L2 NA rhea NA NA NA -1 Production-Degradation NA Inhibiting
Preview of the DF MetaLinksDB_Type including metabolite-receptor and metabolite-transporter sets.
hmdb metabolite pubchem uniprot gene_symbol term_specific mode_of_regulation term
HMDB0000660 D-Fructose 439709 A0PJK1 SLC5A10 transporter_Production-Degradation Activating transporter
HMDB0002212 Arachidic acid 10467 A6NFH5 FABP12 other_protein_Production-Degradation Activating other_protein
HMDB0037790 Polyethylene glycol 174 O00206 TLR4 catalytic_receptor_Ligand-Receptor Binding catalytic_receptor
HMDB0034442 L-trans-alpha-Amino-2-carboxycyclopropaneacetic acid 1271 O00222 GRM8 gpcr_Ligand-Receptor Binding gpcr

Comparison of PK gene and metabolite coverage

We have now loaded a number of different PK sources (details above) and aim to compare the overlap in coverage between these PK sources to understand if there are certain genes or metabolites covered by one PK resource but not the others. Here we will focus on comparing the resources containing gene-metabolite sets (Hallmarks, Gaude, MetalinksDB).
MetaProViz:::GenerateUpset() helper function, which will generate an Upset plot showing the overlap of coverage (similar to how a Venn diagram works, but better able to visualize multiple combinations clearly).

# Ensure no NA values and extract unique features
hallmarks_features <- unique(na.omit(Hallmarks_GeneMetab$GeneMetabSet$feature))
gaude_features <- unique(na.omit(Gaude_GeneMetab$GeneMetabSet$feature))
metalinks_features <- unique(c(
  na.omit(MetaLinksDB_Res$MetalinksDB$hmdb),
  na.omit(MetaLinksDB_Res$MetalinksDB$gene_symbol)
))

# Get all unique features across the three sets
all_features <- unique(c(hallmarks_features, gaude_features, metalinks_features))

# Ensure all_features is a vector
all_features <- as.character(all_features)

df_binary <- data.frame(
  Feature = all_features,
  Hallmarks = as.integer(all_features %in% hallmarks_features),
  Gaude = as.integer(all_features %in% gaude_features),
  Metalinks = as.integer(all_features %in% metalinks_features),
  stringsAsFactors = FALSE
)

# Here we categorise the featurs as either metabolites (if the prefix is HMDB) or genes
df_binary$Type <- ifelse(grepl("^HMDB", df_binary$Feature), "metabolite (HMDB)", "gene")

# Add a 'None' column for features not present in any set
df_upset_pk_comparison <- df_binary %>%
  mutate(None = as.integer(rowSums(select(., -Feature, -Type)) == 0))

Preview of the DF df_upset_pk_comparison showing coverage of features (genes or metabolites) across PK sources.
Feature Hallmarks Gaude Metalinks Type None
FABP4 1 0 1 gene 0
ADIPOQ 1 0 0 gene 0
PPARG 1 0 1 gene 0
LIPE 1 0 1 gene 0
DGAT1 1 1 1 gene 0
LPL 1 1 1 gene 0
CPT2 1 1 1 gene 0
CD36 1 0 1 gene 0


MetaProViz:::GenerateUpset(df = df_upset_pk_comparison,
                          class_col = "Type",
                          intersect_cols = c("Hallmarks", "Gaude", "Metalinks"),
                          plot_title = "Overlap of Prior Knowledge Resources",
                          palette_type = "polychrome",
                          output_file = NULL)


The UpsetPlot shows that MetalinksDB has unique genes and metabolites not present in the other resources. This is to be expected, since MetalinksDB focuses on receptors, transporters and metabolic enzymes, whilst Gaude and Hallmakrs focus on pathways.
Since Gaude only focuses on metabolic enzymes of pathways, it contains a low number of genes and hence has a small amount of unique genes and metabolites. In this regard it makes sense that Hallmarks includes many unique genes, since Hallmarks also includes other genes and pathways not related to metabolism. Between Hallmarks and Gaude we observe a high overlap with metabolites, which can be explained since both gene sets are assigned metabolites by using the same metabolic pathways in the backend of MetaProViz::Make_GeneMetabSet and it is important to keep in mind that this dictates the gene-metabolite associations added.

3. Linking experimental data to prior knowledge

Now that we have loaded the prior knowlegde and inspected the overlap of genes and metabolites between the different PK resources, in this section we want to link our experimental data to the PK.
The experimental data in this example will be the data loaded in section 1 (Loading the example data). Whichever resource the experimental data resembles, we recommend to follow this overall approach:

1. Determine the metabolite identifiers from the experimental data and inspect their coverage.
2. Select the metabolite identifiers from the experimental data, which can be used to link to the PK of interest.
3. Connect metabolite identifiers to the PK and assess overlap.

Here one can select any combination of experimental data and PK. This selection should be based on the experimental data at hand and the biological research question.
For this vignette, we will focus on the combination of biocrates experimental data and MetalinksDB PK.

Biocrates data and MetalinksDB

1. Determine identifiers and inspect coverage

First we determine the identifiers and inspect the coverage.
In many experimentally aquired data, there can be different metabolite IDs used to describe the metabolite. In the Biocrates data the available metabolite identifiers are mostly CHEBI, HMDB, and LIMID that could be linked to PK resources. Now we select these metabolite identifier columns to count the coverage and look at the combinations of the coverage grouped by the class of metabolite. This helps us to understand what to expect, when linking the data to prior knowledge.

df_binary_biocft <- data.frame(
  trivname = FeatureMetadata_Biocrates$TrivialName,
  CHEBI = as.integer(!is.na(FeatureMetadata_Biocrates$CHEBI)),
  HMDB  = as.integer(!is.na(FeatureMetadata_Biocrates$HMDB)),
  LIMID = as.integer(!is.na(FeatureMetadata_Biocrates$LIMID))
)

df_upset_biocft <- df_binary_biocft %>%
  mutate(
    None = as.integer(rowSums(across(c(CHEBI, HMDB, LIMID))) == 0),
    Class = FeatureMetadata_Biocrates$Class  # or however you access the Class column
  )

Preview of the DF df_upset_biocft showing coverage of identifiers.
trivname CHEBI HMDB LIMID None Class
1-Met-His 1 1 0 0 Aminoacids Related
3-IAA 1 1 0 0 Indoles Derivatives
3-IPA 1 1 0 0 Indoles Derivatives
3-Met-His 1 1 0 0 Aminoacids Related
5-AVA 1 1 1 0 Aminoacids Related
AABA 1 1 1 0 Aminoacids Related
AbsAcid 1 1 1 0 Hormones
Ac-Orn 1 1 0 0 Aminoacids Related


MetaProViz:::GenerateUpset(df = df_upset_biocft,
                          class_col = "Class",
                          intersect_cols = c("LIMID", "HMDB", "CHEBI", "None"),
                          plot_title = "Overlap of Prior Knowledge Resources",
                          palette_type = "polychrome",
                          output_file = NULL)


These results tell us that:

  • Less than half the metabolites had CHEBI+HMDB+LIMID (n=440), which isn’t great but probably higher than we have expected.

  • If you relied only upon -

    • CHEBI: then you would miss at least 229 metabolites with HMDB+LIMID

    • HMDB: then you would miss at least 126 metabolites with CHEBI+LIMID

    • LIMID: then you would miss at least 89 metabolites with CHEBI+HMDB

However there are two things to note about these observations above.

The first is that these numbers are minimums, the real values are a little higher once you add for instance metabolites with just CHEBI or just LIMID, etc. So actually, relying upon only LIMID means we miss 151 biocrates metabolites (89 with only HMDB/ChEBI + 22 with only CHEBI + 8 with only HMDB and 32 with no LIMID/CHEBI/HMDB identifiers).

The second point is more nuanced but just as important: we need to keep in mind that there is not a direct 1:1 relationship between the number of Biocrates metabolites with, for instance, a ChEBI ID, and the number of unique ChEBI IDs that they map to. This upset plot is treating each Biocrates metabolite as an individual entry and categorising whether it has a ChEBI, HMDB, LIMID ID, etc. However we are not considering what those ChEBI, HMDB, LIMIDs actually are. So it is possible that there are cases where the number of metabolites is higher, or lower, than the number of unique IDs due to multimapping.

For example, let’s say we had 3 Biocrates metabolites with only HMDB IDs. These would count as 3 entries in the upset plot. But two of these Biocrates metabolites could both map to the same HMDB ID, whereas the other Biocrates metabolite could map to 6 different HMDBs. In this case, there are 3 Biocrates metabolites with HMDBs but 7 unique HMDBs. So this upset plot should be interpreted with this in mind.

We have inspected the experimental coverage of the metabolite IDs from the Biocrates kit. Next, we have to choose which metabolite identifier to use to link to the PK of choice. Often this choice is dictated by the prior knowledge resource as most of them use a specific identifier.
Here we will use MetalinksDB, which uses HMDB IDs as metabolite identifiers, hence it is best to use the HMDB IDs to link the Biocrates features to the resource. Noteworthy, in some cases the Biocrates metabolites have multiple HMDB IDs listed per metabolite:

# Count entries and record NA information
result_bioc_hmdb_count <- MetaProViz:::CountEntries(FeatureMetadata_Biocrates, "HMDB")
# Access the processed data:
processed_df_bioc_hmdb_count <- result_bioc_hmdb_count$result
# Display the plot:
print(result_bioc_hmdb_count$plot)

From this plot we can make a few observations. First of all, as already pointed out in the previous section, the number of biocrates cells without any HMDB at all is very high (n=253). Secondly, we can see that while a number of biocrates cells have only a single HMDB ID associated with them (n=314), the majority of biocrates metabolites actually have multiple entries (n=452). For those with multiple, this ranges from 2-20 HMDB entries per biocrates cell!

With so many biocrates cells linked to multiple HMDB IDs, how should we deal w0.ith this? Should we only select a single HMDB ID in cases where we have multiple? We discuss the pros and cons of this later in section: Bonus: Are multiple metabolite IDs helpful or a hindrance?. But for now, we will proceed with making the connections to the PK using all HMDBs, which the MetaProViz function CheckMatchID() has been designed to handle.

3. Make connection to PK and assess overlap

Here we connect the experimental Biocrates table to the MetalinksDB PK using the HMDB IDs using the MetaProViz::CheckMatchID() function.

Biocrates_to_MetalinksDB <- MetaProViz:::CheckMatchID(InputData = FeatureMetadata_Biocrates,
                     PriorKnowledge = MetaLinksDB_Res$MetalinksDB,
                     SettingsInfo = c(InputID="HMDB", PriorID="hmdb", GroupingVariable=NULL))
#> Warning in MetaProViz:::CheckMatchID(InputData = FeatureMetadata_Biocrates, :
#> 253 NA values were removed from columnHMDB
#> Warning in MetaProViz:::CheckMatchID(InputData = FeatureMetadata_Biocrates, : 4
#> duplicated IDs were removed from columnHMDB
#> No hmdb provided. If this was not intentional, please check your input.
#> Warning in MetaProViz:::CheckMatchID(InputData = FeatureMetadata_Biocrates, :
#> 40766 duplicated IDs were removed from columnhmdb


This returns some warning messages and 4 tables:

  • InputData_Matched: a summary table with links to the prior knowledge. This is however with any NA values or duplicates of the InputID (e.g. ‘HMDB’) removed from the table.

  • InputData_Matched_NA_and_duplicates: the same kind of summary table but with all NA and duplicates included. Helpful for plotting.

  • InputData_Matched_only_NA: the same kind of summary table but with only rows where the InputID was NA. Helpful for inspecting which rows were NA.

  • InputData_Matched_only_duplicates: the same kind of summary table but with only rows where the InputID was duplicated (or triplicated, etc). Helpful for inspecting which IDs on the input side were duplicated.

In some cases we will expect to have only input ID linked to one entry in the PK. But this won’t always be the case, and as already discussed in the last section above, in many cases here we will have multiple HMDB IDs per single biocrates metabolite cell. The MetaProViz::CheckMatchID() has been designed with this in mind, by default splitting any comma separated values in the InputID (or PriorID) into separate entities, counting the number of links between these, and reporting this to the user.

Let’s take a look at the InputData_Matched data:

Preview of the DF InputData_Matched showing coverage of identifiers (some columns hidden)
HMDB found_match_in_PK matches match_overlap_percentage original_count matches_count GroupingVariable OriginalEntry_InputData num_PK_entries num_PK_entries_groups num_Input_entries num_Input_entries_groups ActionRequired TrivialName Class CAS CHEBI LIMID
HMDB0000001 FALSE 0 1 0 NA HMDB0000001 0 0 1 1 NA 1-Met-His Aminoacids Related 332-80-9 50599 NA
HMDB0000033 FALSE 0 1 0 NA HMDB0000033 0 0 1 1 NA Carnosine Aminoacids Related 305-84-0 15727 NA
HMDB0000036 TRUE HMDB0000036 100 1 1 None HMDB0000036 1 1 1 1 None TCA Bile Acids 81-24-3 28865 LMST05040001
HMDB0000043 TRUE HMDB0000043 100 1 1 None HMDB0000043 1 1 1 1 None Betaine Aminoacids Related 107-43-7 17750 NA
HMDB0000056 TRUE HMDB0000056 100 1 1 None HMDB0000056 1 1 1 1 None beta-Ala Biogenic Amines 107-95-9 16958 NA
HMDB0000062 TRUE HMDB0000062 100 1 1 None HMDB0000062 1 1 1 1 None C0 Acylcarnitines 461-06-3, 541-14-0, 541-15-1 11060, 16347, 17126 NA
HMDB0000063 TRUE HMDB0000063 100 1 1 None HMDB0000063 1 1 1 1 None Cortisol Hormones 50-23-7 17650 LMST02030001
HMDB0000072, HMDB0000958 FALSE 0 2 0 NA NA NA NA NA NA NA AconAcid Carboxylic Acids 499-12-7, 585-84-2, 4023-65-8 22211, 32805, 32806 NA

Now let’s take a look here at the duplicated IDs:

Preview of the DF InputData_Matched_only_duplicates showing coverage of identifiers (some columns hidden)
HMDB found_match_in_PK matches match_overlap_percentage original_count matches_count GroupingVariable OriginalEntry_InputData num_PK_entries num_PK_entries_groups num_Input_entries num_Input_entries_groups ActionRequired TrivialName Class CAS CHEBI LIMID
HMDB0005374, HMDB0044753 FALSE 0 2 0 NA NA NA NA NA NA NA TG 18:0_32:0 Triacylglycerols 2177-97-1, 2177-98-2, 57396-99-3 NA LMGL03010036, LMGL03012849, LMGL03013449, LMGL03013912, LMGL03015081, LMGL03015409
HMDB0013408, HMDB0013419 FALSE 0 2 0 NA NA NA NA NA NA NA PC O-38:0 Phosphatidylcholines NA 67071, 74482, 86243 LMGP01020060, LMGP01020097
HMDB0045896, HMDB0046078 FALSE 0 2 0 NA NA NA NA NA NA NA TG 18:3_34:1 Triacylglycerols 78144-20-4 NA LMGL03010127, LMGL03013513, LMGL03013530, LMGL03014811, LMGL03014828, LMGL03015188, LMGL03015205, LMGL03015538, LMGL03015555, LMGL03015893
HMDB0047778 FALSE 0 1 0 NA HMDB0047778 0 0 1 1 NA TG 16:0_32:3 Triacylglycerols NA NA LMGL03013336, LMGL03014633, LMGL03015359, LMST01031127
HMDB0013408, HMDB0013419 FALSE 0 2 0 NA NA NA NA NA NA NA PC O-38:1 Phosphatidylcholines NA 67070, 74482, 86243, 86438 LMGP01020060, LMGP01020097, LMGP01020099
HMDB0005374, HMDB0044753 FALSE 0 2 0 NA NA NA NA NA NA NA TG 18:0_32:1 Triacylglycerols NA NA LMGL03010036, LMGL03012849
HMDB0047778 FALSE 0 1 0 NA HMDB0047778 0 0 1 1 NA TG 18:2_30:1 Triacylglycerols NA NA LMGL03013864, LMGL03014985
HMDB0045896, HMDB0046078 FALSE 0 2 0 NA NA NA NA NA NA NA TG 20:0_32:4 Triacylglycerols NA NA LMGL03013584, LMGL03014811, LMGL03014828

Now it would be nice to visualise the linkages to get an idea of how well the experimental data (Biocrates in this case) matches to the PK (MetalinksDB in this case). Note again that we also want to take into account the multiple IDs for each metabolite when assessing coverage. In the plot below we do this by counting a link to the PK for any of the HMDB IDs for an individual metabolite only once.

# Define your color scheme and labels
fill_vals <- c("FALSE" = "red", "TRUE" = "#009E73")
fill_lbls <- c("No Match", "Found Match")

# Call the function with desired parameters
p <- MetaProViz:::GenerateStackedBar(
  data = Biocrates_to_MetalinksDB$InputData_Matched_NA_and_duplicates,
  group_col = "Class",
  fill_col = "found_match_in_PK",
  fill_values = fill_vals,
  fill_labels = fill_lbls,
  plot_title = "Mapping status between Biocrates and MetalinksDB \nusing HMDB, grouped by metabolite class",
  x_label = "Frequency",
  y_label = "Class",
  legend_position = c(0.95, 0.05)
)

p

From the above plot we see that a few of the metabolite classes from the biocrates kit have every metabolite with a HMDB ID and a corresponding entry in MetalinksDB. These are represented by bars that are completely green, such as the Amino Acids and Fatty Acids classes.

However for a number of other classes of metabolites, we see that they will not be represented in the MetalinksDB PK at all, either because we do not have a HMDB ID associated with the Biocrates metabolite (represented by grey bars), or because MetalinksDB does not include a HMDB ID for that metabolite (represented by red bars). This poor coverage of course could be a concern for our analysis if we are interested in analysing many of these classes, since it will mean in this case that our experimental results for Phosphatidylglycerols for instance will not be linked to PK. In any case we will certainly need to keep this in mind for downstream analysis and the interpretation of results, so that we don’t overinterpret results that for instance have a large number of Amino Acids, or falsely assume that the absence of Phosphatidylinositols in our PK integration results means that they are not present or important in our data.

Bonus: Are multiple metabolite IDs helpful or a hindrance?

Let’s turn back to consider the number of HMDB IDs we have in the Biocrates data and ask ourself a question: is it helpful or detrimental to have multiple IDs?

To answer this, we will take only the first HMDB of each cell with multiple HMDB IDs and test to see this has worked.

extract_first_id <- function(id_col) {
  sapply(as.character(id_col), function(x) {
    # Check for NA or empty string
    if (is.na(x) || x == "") {
      return(NA)
    }
    # Split on comma (adjust the delimiter if needed)
    parts <- unlist(strsplit(x, split = ","))
    # Return the first value after trimming any whitespace
    return(trimws(parts[1]))
  })
}

# Create a copy of the df
FeatureMetadata_Biocrates_singleHMDB <- FeatureMetadata_Biocrates
# Get the first entry of each HMDB ID
FeatureMetadata_Biocrates_singleHMDB$HMDB_single <- extract_first_id(FeatureMetadata_Biocrates$HMDB)

# Visually check that the single ID function has worked
# Count entries and record NA information
result_bioc_hmdb_count_single <- MetaProViz:::CountEntries(FeatureMetadata_Biocrates_singleHMDB, "HMDB_single")
# Access the processed data:
processed_df_bioc_hmdb_count_single <- result_bioc_hmdb_count_single$result
# Display the plot:
print(result_bioc_hmdb_count_single$plot)

Now that we can see we only have either NA or singular values for our HMDB IDs, let’s map this table to MetalinksDB using the same function as we did earlier, but this time using the HMDB_single column.

Biocrates_to_MetalinksDB_singleHMDB <- MetaProViz:::CheckMatchID(InputData = FeatureMetadata_Biocrates_singleHMDB,
                     PriorKnowledge = MetaLinksDB_Res$MetalinksDB,
                     SettingsInfo = c(InputID="HMDB_single", PriorID="hmdb", GroupingVariable=NULL))
#> Warning in MetaProViz:::CheckMatchID(InputData =
#> FeatureMetadata_Biocrates_singleHMDB, : 253 NA values were removed from
#> columnHMDB_single
#> Warning in MetaProViz:::CheckMatchID(InputData =
#> FeatureMetadata_Biocrates_singleHMDB, : 49 duplicated IDs were removed from
#> columnHMDB_single
#> No hmdb provided. If this was not intentional, please check your input.
#> Warning in MetaProViz:::CheckMatchID(InputData =
#> FeatureMetadata_Biocrates_singleHMDB, : 40766 duplicated IDs were removed from
#> columnhmdb

This doesn’t appear to look that different to before, but if we look closely we can see that some classes have shifted a bit, such as Phosphatidylethanolamines for instance. To better visualise these differences we will zoom in on these changes by filtering only to the successful matches between the Biocrates kit and MetalinksDB, comparing which metabolites were found using only the single hits versus using both.

#> [1] "Counts of linkages made using only a Single HMDB"
#> 
#> FALSE  TRUE 
#>    29   147
#> 
#> Using Multi HMDBs increased coverage over Single HMDBs by: 19.73 %

This has shown that in this case, although the results were overshadowed by the poor overall linkage of the experimental data to the PK, using Multi HMDBs has resulted in nearly 20% (n=29) more metabolites from Biocrates being able to be linked to MetalinksDB than what would have been possible if we only used the first HMDB available to us. Hence while having multiple IDs for a single metabolite may add to confusion as a user, we would recommend against prematurely dropping any IDs until you map to the PK or have thoroughly assessed what impact the removal may have.

4. Translate IDs


Important Information: Translating IDs between databases e.g. KEGG to HMDB is a non-trivial task, and it is expected for one original ID to link to many translated IDs, and vice versa. We discuss the implications throughout this vignette and leave it to user discretion to select the most appropriate ID based on their research question and data.


Across the different prior knowledge resources (see also tables above) specific metabolite IDs are used and hence depending on the prior knowledge resource a specific metabolite ID is required.
If we want to convert or ‘translate’ those IDs to another commonly used form of ID, for instance because our measured data uses another type of ID, we can make use of the MetaProViz::TranslateID() function. This is based on OmniPathR and RaMP-DB (Braisted et al. 2023) on the backend of our function and currently supports ID translation of metabolites to and from the following formats:
- KEGG
- HMDB
- ChEBI
- PubChem

As an example we are translating the KEGG pathways we loaded with MetaProViz::LoadKEGG into HMDB ids:

KEGG_Pathways_Translated <- MetaProViz::TranslateID(InputData= KEGG_Pathways,
                                                     SettingsInfo = c(InputID="MetaboliteID", GroupingVariable="term"),
                                                     From = c("kegg"),
                                                     To = c("hmdb", "pubchem"))

Translation of KEGG IDs in KEGG pathways to HMDB & pubchem IDs
Description MetaboliteID term Metabolite pubchem compound_names hmdb
map00010 C00022 Glycolysis / Gluconeogenesis Pyruvate 1060, 107735 Pyruvate , Pyruvic acid , 2-Oxopropanoate , 2-Oxopropanoic acid, Pyroracemic acid HMDB0000243
map00010 C00024 Glycolysis / Gluconeogenesis Acetyl-CoA 444493, 181 Acetyl-CoA , Acetyl coenzyme A HMDB0001206, HMDB0247926
map00010 C00031 Glycolysis / Gluconeogenesis D-Glucose 64689, 5793, 107526 D-Glucose , Grape sugar , Dextrose , Glucose , D-Glucopyranose HMDB0000122, HMDB0304632, HMDB0000516, HMDB0003340, HMDB0006564, HMDB0062170
map00053 C15926 Ascorbate and aldarate metabolism beta-L-Galactose 1-phosphate beta-L-Galactose 1-phosphate, L-Galactose 1-phosphate
map00053 C15930 Ascorbate and aldarate metabolism L-Galactonate 604 L-Galactonate , L-Galactonic acid HMDB0253886
map00120 C13550 Primary bile acid biosynthesis Cerebrosterol 121948 Cerebrosterol , 24-Hydroxycholesterol , Cholest-5-ene-3,24-diol , (24S)-24-Hydroxycholesterol , (24S)-Cholest-5-ene-3beta,24-diol HMDB0001419
map00120 C15518 Primary bile acid biosynthesis (24S)-Cholest-5-ene-3beta,7alpha,24-triol 11954196 (24S)-Cholest-5-ene-3beta,7alpha,24-triol, 5-Cholesten-3beta,7alpha,24(S)-triol HMDB0060136
map00120 C15519 Primary bile acid biosynthesis 25-Hydroxycholesterol 53477807, 65094, 3247051, 42608393 25-Hydroxycholesterol , Cholest-5-ene-3beta,25-diol HMDB0006247


Here we can immediately see that despite the ID translation, some of the translations between the KEGG MetaboliteID and the HMDB or PubChem IDs have failed, resulting in NA values. To get a better understanding of the combinations of these, let’s visualise the translation for each of the ID types.

We can also note from the previous table that it becomes apparent that the translation of IDs is not a one-to-one mapping, but rather a one-to-many mapping. In fact it is very common that an ID from one format will have a genuine one-to-many relationship with the other format (e.g. one KEGG ID maps to multiple HMDB IDs) or even a many-to-many relationship, where some of the IDs from the new format link back to multiple IDs in the original format (e.g. two different KEGG IDs map to multiple HMDS IDs, some of which are shared between them).
This comes with many implications for the analysis that will be discussed in the next section.

4.1 Mapping problems

The complexities of translating metabolite IDs are demonstrated here (Fig.2). The relationships between Original IDs (e.g. KEGG) and Translated IDs (e.g. HMDB), can be quite complex and in fact we encounter one-to-one (no matches were found for the ID), one-to-none ( direct #relationship was established), one-to-many (multiple matches were found for the ID. i.e. it is ambiguously mapped) and many-to-many ( considers the relationships between the Translated IDs to Original IDs , where a translated ID ambiguously maps back to multiple different Original IDs) mappings.
For enrichment analysis the translation from KEGG IDs to HMDB IDs increases the pathways size, i.e. how many metabolites are in the pathway “Glycolysis / Gluconeogenesis - Homo sapiens (human)”, which would in turn inflate/deflate the enrichment results and hence it would be desired to keep the number of metabolites in a pathway consistent.

Fig. 2: Mapping problems in prior knowledge metabolite-sets when translating metabolite IDs.
Fig. 2: Mapping problems in prior knowledge metabolite-sets when translating metabolite IDs.


Because of this complexity the output of MetaProViz::TranslateID() includes not only the Translation table showcased above, but additionally information about the mapping ambiguity as well as a summary of the relationships between the Original and Translated IDs.
Indeed, the translation of e.g. KEGG to hmdb and pubchem includes multiple data frames including a summary of the mapping occurrences:

names(KEGG_Pathways_Translated)
#> [1] "TranslatedDF"             "TranslatedDF_MappingInfo"
#> [3] "MappingSummary_hmdb"      "MappingSummary_pubchem"

MappingSummary_hmdb
term many_to_many one_to_many one_to_none one_to_one total to_many to_none to_one to_total from_many from_one from_total many_to_one scope
Glycolysis / Gluconeogenesis 12 23 6 13 54 12 6 13 31 6 37 43 0 group
Citrate cycle (TCA cycle) 0 17 3 9 29 8 3 9 20 0 27 27 0 group
Pentose phosphate pathway 20 17 10 16 63 11 10 16 37 10 34 44 0 group
Ascorbate and aldarate metabolism 10 39 26 15 90 16 26 15 57 5 55 60 0 group
Fatty acid biosynthesis 0 14 46 6 66 6 46 6 58 0 21 21 0 group
Fatty acid elongation 8 48 11 3 70 25 11 3 39 4 52 56 0 group



We also have the ability to extract a long version of the DF that includes a row for each mapping occurrence, which can be useful for downstream analysis. Yet, this can become very large dependent on the amount of many-to-many mappings, hence by default we do not generate this summary. Within the MetaProViz::TranslateID() you can set the parameter Summary =TRUE or in case you have a dataframe that includes both, original and translated ID, you can use the function MetaProViz::MappingAmbiguity() to generate this long summary as well as the mapping summary in general.

# Option 1:
KEGG_Pathways_TranslatedSum <- MetaProViz::TranslateID(InputData= KEGG_Pathways,
                                                    SettingsInfo = c(InputID="MetaboliteID", GroupingVariable="term"),
                                                    From = c("kegg"),
                                                    To = c("hmdb", "pubchem"),
                                                    Summary =TRUE)


# Option 2:
MappingProblems <- MetaProViz::MappingAmbiguity(
    InputData =
        KEGG_Pathways_Translated[["TranslatedDF"]] %>%
        dplyr::rename("KEGG"="MetaboliteID") %>%
        dplyr::select(Description, KEGG, term, Metabolite, hmdb),
    From = "KEGG",
    To = "hmdb",
    GroupingVariable = "term",
    Summary=TRUE
)

Long summary of mapping problems taking into account both directions, From-to-To and To-to-From.
KEGG hmdb term KEGG_to_hmdb Count(KEGG_to_hmdb) hmdb_to_KEGG Count(hmdb_to_KEGG) Mapping
C00002 HMDB0000538 Bacterial secretion system C00002 –> HMDB0000538, HMDB0257997 2 HMDB0000538 –> C00002 1 one-to-many
C00002 HMDB0000538 Biosynthesis of cofactors C00002 –> HMDB0000538, HMDB0257997 2 HMDB0000538 –> C00002 1 one-to-many
C00024 HMDB0247926 Glutathione metabolism C00024 –> HMDB0001206, HMDB0247926 2 HMDB0247926 –> C00024 1 one-to-many
C00024 HMDB0247926 Glyoxylate and dicarboxylate metabolism C00024 –> HMDB0001206, HMDB0247926 2 HMDB0247926 –> C00024 1 one-to-many


In the table it is shown that the KEGG ID C00002 maps to 3 different HMDB IDs, and it is shown that one of those HMDB IDs HMDB0000538 maps to one KEGG ID, hence this Mapping is one-to-many. The other two HMDB Ids are also in the table and it is checked to how many KEGG IDs they map. Additionally, we have passed GroupingVariable = "term" as we have pathways, which also means each of those mappings is checked within a pathway and across pathways (e.g. C00002 is shown twice, for two different terms).

In fact, to perform enrichment analysis we need a column source (=e.g. term) and we would want to keep the metabolite IDs across pathways consistent, avoid ambiguous mapping as much as possible (many-to-many mapping), and have this metabolite ID selection guided by any IDs we might have available to us in our measured input data (Fig. 3). This is crucial to circumvent inflation or deflation of metabolite-sets, which in turn will affect the enrichment analysis results. Hence, we can use measured metabolite IDs to guide the selection of PK IDs. This is crucial to circumvent inflation or deflation of metabolite-sets, which in turn will affect the enrichment analysis results.
This is something we are currently working on and hope to provide within the next release, so stay tuned.

Fig. 3: Mapping problems in prior knowledge metabolite-sets when translating metabolite IDs and the connection to detected (measured input metabolites).
Fig. 3: Mapping problems in prior knowledge metabolite-sets when translating metabolite IDs and the connection to detected (measured input metabolites).


5. Run enrichment analysis


There are two options:
1. Over Representation Analysis (ORA) that determines if a set of features (=metabolic pathways) are over-represented in the selection of features (=metabolites) from the data in comparison to all measured features (metabolites) using the Fishers exact test MetaProViz::ClusterORA. This can be applied to clusters of metabolites as for example the results from MetaProViz::MCA_2Cond() or MetaProViz::CoRe() function. If you want to have more details on these clustering methods please visit the vignette Standard Metabolomics or CoRe Metabolomics.


2. Enrichment analysis on standard differential analysis results. We offer ORA MetaProViz::StandardORA, but there are many other statistical tests that can be used for enrichment analysis. The full scope of different methods is beyond the scope of MetaProViz, but are available in decoupleR (Badia-I-Mompel et al. 2022) packages from our group.



As part of the MetaProViz package you can load the example data using the function toy_data() to test the enrichment analysis:
1.Cell line experiment (CellLine)
Here we choose an example datasets, which is publicly available on metabolomics workbench project PR001418 including metabolic profiles of human renal epithelial cells HK2 and cell renal cell carcinoma (ccRCC) cell lines cultured in Plasmax cell culture media (Sciacovelli et al. 2022). The raw data are available via metabolomics workbench study ST002224 were intracellular metabolomics of HK2 and ccRCC cell lines 786-O, 786-M1A and 786-M2A were performed.
We have performed pre-processing and differential analysis (details can be found in the vignette vignette Standard Metabolomics) and and here we load the differential metabolite analysis results for the comparison of 786M-1A versus HK2.

# Load the Pre-processed intracellular data:
Intra_DMA_786M1A_vs_HK2<- MetaProViz::ToyData(Data="IntraCells_DMA")

2. Tissue experiment (Tissue)
Here we chose publicly available data from the paper “An Integrated Metabolic Atlas of Clear Cell Renal Cell Carcinoma”, which includes metabolomic profiling on 138 matched clear cell renal cell carcinoma (ccRCC)/normal tissue pairs.
We have performed differential analysis (details can be found in the vignette Metadata Analysis) and and here we load the differential metabolite analysis results for the comparison of Tumour versus Normal.

# Load the example data:
Tissue_TvsN <- MetaProViz::ToyData(Data="Tissue_DMA")
Tissue_TvsN_Old <- MetaProViz::ToyData(Data="Tissue_DMA_Old")
Tissue_TvsN_Young <- MetaProViz::ToyData(Data="Tissue_DMA_Young")



Session information

#> R version 4.4.3 (2025-02-28)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] stringr_1.5.1    dplyr_1.1.4      purrr_1.0.4      rlang_1.1.5      magrittr_2.0.3   MetaProViz_2.1.7
#> 
#> loaded via a namespace (and not attached):
#>   [1] DBI_1.2.3            gridExtra_2.3        httr2_1.1.2          tcltk_4.4.3          logger_0.4.0        
#>   [6] readxl_1.4.5         compiler_4.4.3       RSQLite_2.3.9        systemfonts_1.2.1    vctrs_0.6.5         
#>  [11] rvest_1.0.4          pkgconfig_2.0.3      crayon_1.5.3         fastmap_1.2.0        backports_1.5.0     
#>  [16] labeling_0.4.3       rmarkdown_2.29       sessioninfo_1.2.3    tzdb_0.5.0           ggbeeswarm_0.7.2    
#>  [21] ragg_1.3.3           ggfortify_0.4.17     bit_4.6.0            xfun_0.51            cachem_1.1.0        
#>  [26] jsonlite_2.0.0       progress_1.2.3       blob_1.2.4           later_1.4.1          broom_1.0.8         
#>  [31] parallel_4.4.3       prettyunits_1.2.0    R6_2.6.1             bslib_0.9.0          stringi_1.8.7       
#>  [36] ComplexUpset_1.3.3   limma_3.63.11        car_3.1-3            lubridate_1.9.4      jquerylib_0.1.4     
#>  [41] cellranger_1.1.0     Rcpp_1.0.14          knitr_1.50           R.utils_2.13.0       readr_2.1.5         
#>  [46] igraph_2.1.4         timechange_0.3.0     tidyselect_1.2.1     rstudioapi_0.17.1    abind_1.4-8         
#>  [51] yaml_2.3.10          curl_6.2.2           tibble_3.2.1         withr_3.0.2          inflection_1.3.6    
#>  [56] evaluate_1.0.3       desc_1.4.3           zip_2.3.2            xml2_1.3.8           pillar_1.10.1       
#>  [61] ggpubr_0.6.0         carData_3.0-5        checkmate_2.3.2      generics_0.1.3       vroom_1.6.5         
#>  [66] hms_1.1.3            ggplot2_3.5.1        munsell_0.5.1        scales_1.3.0         gtools_3.9.5        
#>  [71] OmnipathR_3.15.12    glue_1.8.0           scatterplot3d_0.3-44 tools_4.4.3          ggsignif_0.6.4      
#>  [76] fs_1.6.5             XML_3.99-0.18        grid_4.4.3           tidyr_1.3.1          qcc_2.7             
#>  [81] colorspace_2.1-1     patchwork_1.3.0      beeswarm_0.4.0       vipor_0.4.7          Formula_1.2-5       
#>  [86] cli_3.6.4            rappdirs_0.3.3       Polychrome_1.5.1     kableExtra_1.4.0     textshaping_1.0.0   
#>  [91] viridisLite_0.4.2    svglite_2.1.3        gtable_0.3.6         R.methodsS3_1.8.2    rstatix_0.7.2       
#>  [96] hash_2.2.6.3         selectr_0.4-2        sass_0.4.9           digest_0.6.37        ggrepel_0.9.6       
#> [101] htmlwidgets_1.6.4    farver_2.1.2         memoise_2.0.1        htmltools_0.5.8.1    pkgdown_2.1.1       
#> [106] R.oo_1.27.0          factoextra_1.0.7     lifecycle_1.0.4      httr_1.4.7           statmod_1.5.0       
#> [111] bit64_4.6.0-1        MASS_7.3-64

Bibliography

Badia-I-Mompel, Pau, Jesús Vélez Santiago, Jana Braunger, Celina Geiss, Daniel Dimitrov, Sophia Müller-Dott, Petr Taus, et al. 2022. “decoupleR: Ensemble of Computational Methods to Infer Biological Activities from Omics Data.” Bioinformatics Advances 2 (1): vbac016. https://doi.org/10.1093/bioadv/vbac016.
Braisted, John, Andrew Patt, Cole Tindall, Timothy Sheils, Jorge Neyra, Kyle Spencer, Tara Eicher, and Ewy A Mathé. 2023. “RaMP-DB 2.0: A Renovated Knowledgebase for Deriving Biological and Chemical Insight from Metabolites, Proteins, and Genes.” Bioinformatics, no. 1 (January). https://doi.org/10.1093/bioinformatics/btac726.
Castanza, Anthony Scott, Jill Marie Recla, David Eby, Helga Thorvaldsdottir, Carol J. Bult, and Jill P Mesirov. 2022. “The Molecular Signatures Database Revisited: Extending Support for Mouse Data.” BioRxiv, October. https://doi.org/10.1101/2022.10.24.513539.
Dugourd, Aurelien, Christoph Kuppe, Marco Sciacovelli, Enio Gjerga, Attila Gabor, Kristina B Emdal, Vitor Vieira, et al. 2021. “Causal Integration of Multi-Omics Data with Prior Knowledge to Generate Mechanistic Hypotheses.” Molecular Systems Biology, no. 1 (January): e9730. https://doi.org/10.15252/msb.20209730.
Farr, Elias, Daniel Dimitrov, Christina Schmidt, Denes Turei, Sebastian Lobentanzer, Aurelien Dugourd, and Julio Saez-Rodriguez. 2024. “MetalinksDB: A Flexible and Contextualizable Resource of Metabolite-Protein Interactions.” Briefings in Bioinformatics, no. 4 (May). https://doi.org/10.1093/bib/bbae347.
Gaude, Edoardo, and Christian Frezza. 2016. “Tissue-Specific and Convergent Metabolic Transformation of Cancer Correlates with Metastatic Potential and Patient Survival.” Nature Communications 7 (October): 13041. https://doi.org/10.1038/ncomms13041.
Kanehisa, M, and S Goto. 2000. “KEGG: Kyoto Encyclopedia of Genes and Genomes.” Nucleic Acids Research 28 (1): 27–30. https://doi.org/10.1093/nar/28.1.27.
Liberzon, Arthur, Chet Birger, Helga Thorvaldsdóttir, Mahmoud Ghandi, Jill P Mesirov, and Pablo Tamayo. 2015. “The Molecular Signatures Database (MSigDB) Hallmark Gene Set Collection.” Cell Systems 1 (6): 417–25. https://doi.org/10.1016/j.cels.2015.12.004.
Sciacovelli, Marco, Aurelien Dugourd, Lorea Valcarcel Jimenez, Ming Yang, Efterpi Nikitopoulou, Ana S H Costa, Laura Tronci, et al. 2022. “Dynamic Partitioning of Branched-Chain Amino Acids-Derived Nitrogen Supports Renal Cancer Progression.” Nature Communications 13 (1): 7830. https://doi.org/10.1038/s41467-022-35036-4.