Translates a vector of identifiers, resulting a new vector, or a column of identifiers in a data frame by creating another column with the target identifiers.
Usage
translate_ids(
d,
...,
uploadlists = FALSE,
ensembl = FALSE,
keep_untranslated = TRUE,
return_df = FALSE,
organism = 9606,
reviewed = TRUE
)
Arguments
- d
Character vector or data frame.
- ...
At least two arguments, with or without names. The first of these arguments describes the source identifier, the rest of them describe the target identifier(s). The values of all these arguments must be valid identifier types as shown in Details. The names of the arguments are column names. In case of the first (source) ID the column must exist. For the rest of the IDs new columns will be created with the desired names. For ID types provided as arguments without names, the name of the ID type will be used for column name.
- uploadlists
Force using the
uploadlists
service from UniProt. By default the plain query interface is used (implemented inuniprot_full_id_mapping_table
in this package). If any of the provided ID types is only available in the uploadlists service, it will be automatically selected. The plain query interface is preferred because in the long term, with caching, it requires less download and data storage.- ensembl
Logical: use data from Ensembl BioMart instead of UniProt.
- keep_untranslated
In case the output is a data frame, keep the records where the source identifier could not be translated. At these records the target identifier will be NA.
- return_df
Return a data frame even if the input is a vector.
- organism
Integer, NCBI Taxonomy ID of the organism (by default 9606 for human). Matters only if
uploadlists
isFALSE
.- reviewed
Translate only reviewed (
TRUE
), only unreviewed (FALSE
) or both (NULL
) UniProt records. Matters only ifuploadlists
isFALSE
.
Value
Data frame: if the input is a data frame or the input is a vector and
return_df
isTRUE
.Vector: if the input is a vector, there is only one target ID type and
return_df
isFALSE
.List of vectors: if the input is a vector, there are more than one target ID types and
return_df
isFALSE
. The names of the list will be ID types (as they were column names, see the description of the...
argument), and the list will also include the source IDs.
Details
This function, depending on the uploadlists
parameter, uses either
the uploadlists service of UniProt or plain UniProt queries to obtain
identifier translation tables. The possible values for from
and to
are the identifier type abbreviations used in the UniProt API, please
refer to the table here: https://www.uniprot.org/help/api_idmapping.
In addition, simple synonyms are available which realize a uniform API
for the uploadlists and UniProt query based backends. These are the
followings:
OmnipathR | Uploadlists | UniProt query | Ensembl BioMart |
uniprot | ACC | id | uniprotswissprot |
uniprot_entry | ID | entry name | |
trembl | reviewed = FALSE | reviewed = FALSE | uniprotsptrembl |
genesymbol | GENENAME | genes(PREFERRED) | external_gene_name |
genesymbol_syn | genes(ALTERNATIVE) | external_synonym | |
hgnc | HGNC_ID | database(HGNC) | hgnc_symbol |
entrez | P_ENTREZGENEID | database(GeneID) | |
ensembl | ENSEMBL_ID | ensembl_gene_id | |
ensg | ENSEMBL_ID | ensembl_gene_id | |
enst | ENSEMBL_TRS_ID | database(Ensembl) | ensembl_transcript_id |
ensp | ENSEMBL_PRO_ID | ensembl_peptide_id | |
ensgg | ENSEMBLGENOME_ID | ||
ensgt | ENSEMBLGENOME_TRS_ID | ||
ensgp | ENSEMBLGENOME_PRO_ID | ||
protein_name | protein names | ||
pir | PIR | database(PIR) | |
ccds | database(CCDS) | ||
refseqp | P_REFSEQ_AC | database(refseq) | |
ipro | interpro | ||
ipro_desc | interpro_description | ||
ipro_sdesc | interpro_short_description | ||
wikigene | wikigene_name | ||
rnacentral | rnacentral | ||
gene_desc | description | ||
wormbase | database(WormBase) | ||
flybase | database(FlyBase) | ||
xenbase | database(Xenbase) | ||
zfin | database(ZFIN) | ||
pbd | PBD_ID | database(PDB) | pbd |
The mapping between identifiers can be ambiguous. In this case one row in the original data frame yields multiple rows or elements in the returned data frame or vector(s).
Examples
d <- data.frame(uniprot_id = c('P00533', 'Q9ULV1', 'P43897', 'Q9Y2P5'))
d <- translate_ids(d, uniprot_id = uniprot, genesymbol)
d
#> uniprot_id genesymbol
#> 1 P00533 EGFR
#> 2 Q9ULV1 FZD4
#> 3 P43897 TSFM
#> 4 Q9Y2P5 SLC27A5
# uniprot_id genesymbol
# 1 P00533 EGFR
# 2 Q9ULV1 FZD4
# 3 P43897 TSFM
# 4 Q9Y2P5 SLC27A5