Advanced ID Translation¶
This page covers the internal workings of the translation system for power users and developers. For the basic API reference, see ID Translation.
Controlling translation behavior¶
Raw mode: bypass all special handling¶
Every translation function accepts raw=True. In raw mode, the
translator performs a single direct table lookup -- no case fallbacks,
no chain translation, no UniProt cleanup. This is useful when you need
maximum speed and know your identifiers are already in the exact form
stored in the mapping table.
from omnipath_utils.mapping import map_name, translate
# Raw: exact match only
map_name('TP53', 'genesymbol', 'uniprot', raw=True)
# {'P04637'}
map_name('tp53', 'genesymbol', 'uniprot', raw=True)
# set() -- lowercase, no fallback
# Batch raw translation
translate(
['TP53', 'EGFR', 'tp53'],
'genesymbol', 'uniprot',
raw=True,
)
# {'TP53': {'P04637'}, 'EGFR': {'P00533'}, 'tp53': set()}
curl "https://omnipathdb.org/mapping/translate?\
identifiers=TP53,tp53&\
id_type=genesymbol&\
target_id_type=uniprot&\
raw=true"
The response meta object includes "raw": true to confirm the
parameter was applied.
from omnipath_utils.mapping import translate_column
result = translate_column(
df, 'gene', 'genesymbol', 'uniprot',
raw=True,
)
What raw mode skips:
| Feature | Normal mode | Raw mode |
|---|---|---|
| Gene symbol case fallbacks (UPPER, Capitalize) | Yes | No |
| Gene symbol synonym lookup | Yes | No |
| Chain translation (source -> uniprot -> target) | Yes | No |
| UniProt cleanup (secondary -> primary, TrEMBL -> SwissProt) | Yes | No |
| RefSeq version stripping | Yes | No |
| Ensembl version stripping | Yes | No |
| miRNA reciprocal fallback | Yes | No |
| CURIE prefix stripping | Yes | No |
| Reverse table scan | Yes | No |
Explicit backend selection¶
By default, the mapper auto-selects a backend based on which data
sources support the requested ID type pair. You can force a specific
backend with the backend= parameter.
from omnipath_utils.mapping import map_name, translate
# Force BioMart for Ensembl-centric lookups
map_name('TP53', 'genesymbol', 'ensg', backend='biomart')
# Force UniProt REST for protein IDs
translate(
['P04637', 'P00533'],
'uniprot', 'genesymbol',
backend='uniprot',
)
curl "https://omnipathdb.org/mapping/translate?\
identifiers=TP53&\
id_type=genesymbol&\
target_id_type=uniprot&\
backend=biomart"
Note
The REST API uses the database backend for lookups. The backend
parameter is recorded in the response metadata but does not
change the query behavior in database mode. It is primarily
meaningful for the Python API.
Available backends:
| Backend | Data source | When to force it |
|---|---|---|
uniprot |
UniProt REST API | Default for protein IDs; most comprehensive protein cross-refs |
uniprot_ftp |
UniProt FTP idmapping files | Bulk download; faster for large-scale builds |
uploadlists |
UniProt ID Mapping batch service | Specific, targeted ID sets |
biomart |
Ensembl BioMart | When you need fresh Ensembl data or Ensembl-specific ID types |
mirbase |
miRBase | miRNA names and accessions |
unichem |
UniChem (EMBL-EBI) | Chemical compound cross-references |
ramp |
RaMP-DB | Metabolite cross-references and synonym mappings |
hmdb |
HMDB | HMDB-centric metabolite mappings |
metanetx |
MNXref | Pairwise metabolite ID translation via MetaNetX bridge (bigg↔chebi, kegg↔chebi, hmdb↔chebi, lipidmaps↔chebi, swisslipids↔chebi, metanetx↔*) |
bigg |
BiGG Models | BiGG metabolite mappings (bigg↔chebi, bigg↔hmdb, bigg↔kegg, bigg↔metanetx) |
When backend= is specified, the mapper skips its cached table and
forces a reload from the requested backend. This is useful when the
auto-selected backend returned incomplete data and you want to try a
different source.
Querying SwissProt, TrEMBL, and synonyms directly¶
The uniprot target type runs the full cleanup pipeline (secondary AC
resolution, TrEMBL-to-SwissProt preference, proteome filtering, format
validation). To bypass this and query specific subsets of UniProt, use
the dedicated target types:
from omnipath_utils.mapping import map_name
# Default: prefers SwissProt, runs full cleanup
map_name('TP53', 'genesymbol', 'uniprot')
# {'P04637'} -- SwissProt entry
# Only reviewed (SwissProt) entries, no cleanup
map_name('TP53', 'genesymbol', 'swissprot')
# {'P04637'}
# Only unreviewed (TrEMBL) entries, no cleanup
map_name('TP53', 'genesymbol', 'trembl')
# TrEMBL entries for TP53, if any exist
# Gene symbol synonyms
map_name('p53', 'genesymbol-syn', 'uniprot')
# Looks up 'p53' as a synonym; may find TP53's UniProt AC
Target type behavior:
| Target type | Source data | Cleanup pipeline | Typical use |
|---|---|---|---|
uniprot |
SwissProt + TrEMBL | Yes (full 4-step) | Default; production use |
swissprot |
SwissProt only | No | When you specifically need reviewed entries |
trembl |
TrEMBL only | No | When you need unreviewed entries |
genesymbol-syn |
Gene symbol synonym table | No (source type) | Looking up old/alternative gene names |
The translation pipeline in detail¶
Step-by-step walkthrough¶
Here is what happens when you call:
map_name('tp53', 'genesymbol', 'uniprot')
Step 1: Alias resolution.
IdTypeRegistry.resolve() normalizes the type names. 'genesymbol'
is already canonical. Variants like 'GeneSymbol', 'gene_symbol',
or 'genesymbol_syn' would be resolved to their canonical forms.
Step 2: Direct lookup.
The mapper looks up 'tp53' in the genesymbol -> uniprot table.
Tables are case-sensitive, so 'tp53' is not found (the table has
'TP53'). Result: miss.
Step 3: Gene symbol fallbacks.
Since id_type is 'genesymbol', the fallback chain runs:
'tp53'.upper()='TP53'-- lookup finds{'P04637', 'A0A024R1R8', ...}. Match found; remaining fallbacks are skipped.
If UPPER had failed, the system would try:
'tp53'.capitalize()='Tp53'(for rodent symbols)- Synonym table lookup for
'tp53'and'TP53' - Append "1":
'tp531'(non-strict mode only)
Step 4: UniProt cleanup.
Since the target is 'uniprot', the cleanup pipeline runs on the
raw result set {'P04637', 'A0A024R1R8', ...}:
-
Secondary -> primary: Each AC is checked against the
uniprot-sec -> uniprot-pritable. If it is a secondary AC, it is replaced with the current primary AC. -
TrEMBL -> SwissProt:
A0A024R1R8is not in the SwissProt reference list. The pipeline looks up its gene symbol via thetrembl -> genesymboltable, gets'TP53', then finds the SwissProt entry for'TP53':'P04637'. The TrEMBL AC is replaced. -
Proteome filter: The result set is intersected with the human proteome reference list.
P04637is present; stale or cross-organism ACs are removed. -
Format validation: Each AC is checked against the UniProt AC regex. Invalid strings are discarded.
Step 5: Result.
{'P04637'} -- just the SwissProt entry.
Vectorization¶
The translate() and translate_column() functions use
translate_core() internally, which implements a two-pass strategy:
-
First pass (batch): All identifiers are looked up in the mapping table in a single sweep. This is an O(1) dict lookup per ID. Identifiers that get a direct hit are collected immediately.
-
Second pass (per-ID fallback): Only identifiers that missed in the first pass go through the full
map_name()pipeline with all its fallback strategies.
This means: if your mapping table covers 99% of your identifiers, only 1% go through the slower per-ID fallback path. For typical gene symbol-to-UniProt translations with properly cased input, the first pass handles nearly everything.
# Efficient: translate_core handles the batch
from omnipath_utils.mapping import translate
result = translate(gene_list, 'genesymbol', 'uniprot')
# Less efficient: each call goes through the full pipeline
from omnipath_utils.mapping import map_name
result = {g: map_name(g, 'genesymbol', 'uniprot') for g in gene_list}
When the target type is uniprot, cleanup runs on the batch results
from the first pass as well (not just the fallback results), ensuring
consistent output regardless of which pass produced the hit.
Memory mode vs Database mode¶
The Python API uses memory mode: mapping tables are downloaded from
upstream APIs (UniProt, BioMart, etc.), stored as dict[str, set[str]]
in memory, and cached as pickle files on disk. This is fast for
interactive use and small to medium batch jobs.
The REST API uses database mode: mapping tables are pre-loaded into PostgreSQL during a build step. Translation queries run as SQL queries against indexed tables. This handles concurrent requests and very large ID sets efficiently.
Both modes share the same fallback logic and cleanup pipeline. The difference is the data access layer:
| Aspect | Memory mode | Database mode |
|---|---|---|
| Data loading | On-demand from upstream APIs | Pre-built into PostgreSQL |
| Lookup | Python dict (O(1)) | SQL query (indexed) |
| Fallbacks | map_name() per-ID pipeline |
Batch SQL queries per fallback step |
| Cleanup | Same pipeline | Same pipeline |
| Caching | Pickle files, 5-min memory expiry | Database is the cache |
| Concurrency | Single-process | Multi-process via connection pool |
REST API fallback chain¶
The REST API implements the same fallback strategies as the Python API, but uses batch SQL queries instead of per-ID lookups.
How REST fallbacks work¶
-
Direct DB query: All input IDs are looked up in a single SQL query against the
id_mappingtable. -
Gene symbol fallbacks (batch): For unmapped IDs where the source type is
genesymbol:- Uppercase batch: all unmapped IDs are uppercased and queried
- Capitalize batch: remaining unmapped IDs are capitalized and queried
- Synonym batch: remaining unmapped IDs are queried against
genesymbol-syn - Uppercase synonym batch: remaining unmapped IDs uppercased and
queried against
genesymbol-syn
-
Chain translation (batch): For unmapped IDs where neither side is
uniprot: a batch intermediate query (source -> uniprot), then a batch final query (uniprot -> target). -
UniProt cleanup: Same pipeline as the Python API (secondary -> primary, TrEMBL -> SwissProt, proteome filter, format validation).
When raw=true is passed, steps 2-4 are skipped entirely.
REST query parameters reference¶
| Parameter | Type | Default | Description |
|---|---|---|---|
identifiers |
string | required | Comma-separated IDs (GET) or JSON list (POST) |
id_type |
string | required | Source ID type |
target_id_type |
string | required | Target ID type |
ncbi_tax_id |
int | 9606 | Organism NCBI Taxonomy ID |
raw |
bool | false | Skip all special-case handling |
backend |
string | null | Force specific backend (metadata only in DB mode) |
GET examples:
# Basic translation
curl "https://omnipathdb.org/mapping/translate?\
identifiers=TP53,EGFR&\
id_type=genesymbol&\
target_id_type=uniprot"
# Raw mode -- direct lookup only
curl "https://omnipathdb.org/mapping/translate?\
identifiers=TP53&\
id_type=genesymbol&\
target_id_type=uniprot&\
raw=true"
# Specify organism (mouse)
curl "https://omnipathdb.org/mapping/translate?\
identifiers=Trp53&\
id_type=genesymbol&\
target_id_type=uniprot&\
ncbi_tax_id=10090"
POST example:
curl -X POST "https://omnipathdb.org/mapping/translate" \
-H "Content-Type: application/json" \
-d '{
"identifiers": ["TP53", "EGFR", "BRCA1"],
"id_type": "genesymbol",
"target_id_type": "uniprot",
"raw": false,
"backend": null
}'
Backend details¶
Auto-selection algorithm¶
When no backend is specified, the mapper's _find_backends() method
builds a candidate list:
-
Column-based backends (
uniprot,uniprot_ftp,biomart): For each backend, it checks whether both the source and target ID types have a column defined inid_types.yamlunder that backend's key. If both columns exist, the backend is added to the candidate list. -
Custom backends (
mirbase,unichem,ramp,hmdb,metanetx,bigg): These are always appended to the candidate list. They perform their own internal support check in theirread()method and return an empty dict if the ID type pair is not supported.
The mapper tries each candidate in order. The first backend that returns non-empty data wins.
Backend-specific behavior¶
uniprot¶
- Data: UniProt AC, gene symbols, Entrez, HGNC, RefSeq, PDB, and all cross-references stored in UniProt entries.
- Access:
pypath.inputs.uniprot(preferred) or direct HTTP to the UniProt REST API. - Organism-specific: Yes. Queries are filtered by NCBI taxonomy ID.
- Caching: Pickle file per
(source, target, organism)triple. - When to force: Default for most protein-related translations.
uniprot_ftp¶
- Data: Same scope as
uniprot, but downloaded from UniProt FTP idmapping files (bulk, pre-computed). - Access:
pypath.inputs.uniprot_ftpor direct HTTP to UniProt FTP. - Organism-specific: Yes, but only 12 model organisms have pre-computed files.
- Caching: Same as
uniprot. - When to force: Faster for full-proteome builds; useful when the REST API is slow or rate-limited.
uploadlists¶
- Data: Same scope as UniProt REST.
- Access: Direct HTTP to the UniProt ID Mapping batch service (submit job, poll, collect results).
- Organism-specific: Yes.
- When to force: When you have a specific, bounded set of IDs and want the most up-to-date cross-references.
biomart¶
- Data: Ensembl gene/transcript/protein IDs, gene symbols, Entrez.
- Access:
pypath.inputs.biomartor direct HTTP to the Ensembl BioMart XML service. - Organism-specific: Yes.
- Caching: Pickle file per triple.
- When to force: When translating Ensembl IDs or when you need fresh Ensembl-specific data.
mirbase¶
- Data: Precursor names (
mir-pre), mature names (mir-mat-name), miRBase accessions (mirbase). - Access:
pypath.inputs.mirbase(requires pypath). - Organism-specific: Yes.
- When to force: miRNA translations.
unichem¶
- Data: Cross-references between chemical databases (ChEMBL, ChEBI, DrugBank, PubChem, KEGG, etc.).
- Access:
pypath.inputs.unichem(requires pypath). - Organism-specific: No (chemicals are universal).
- When to force: Chemical compound ID translation.
ramp¶
- Data: Metabolite cross-references and synonym mappings from RaMP-DB.
- Access:
pypath.inputs.ramp(requires pypath). - Organism-specific: No.
- When to force: Metabolite ID translation, especially when you need common-name-to-database-ID resolution.
hmdb¶
- Data: HMDB, PubChem, ChEBI, DrugBank, KEGG compound.
- Access:
pypath.inputs.hmdb(requires pypath). - Organism-specific: Human-derived data, but identifiers are universal.
- When to force: HMDB-centric metabolite lookups.
metanetx¶
- Data: Pairwise metabolite ID cross-references from MNXref
chem_xref.tsv(3.4M cross-reference entries). - Access:
pypath.inputs.metanetx.metanetx_mapping()(requires pypath). - Organism-specific: No (chemicals are universal).
- Supported pairs: bigg↔chebi, kegg↔chebi, hmdb↔chebi, lipidmaps↔chebi, swisslipids↔chebi, and all metanetx↔* combinations.
- Coverage: 82K hmdb→chebi, 45K kegg→chebi, 23K lipidmaps→chebi, 11K bigg→chebi via MetaNetX bridge.
- When to force: Metabolite ID translation, especially for lipidmaps→chebi and swisslipids→chebi which are not available in other backends.
bigg¶
- Data: BiGG Models universal metabolite TSV (9,090 universal metabolites across 85+ models).
- Access:
pypath.inputs.bigg.bigg_metabolite_mapping()(requires pypath). - Organism-specific: No (chemicals are universal).
- Supported pairs: bigg↔chebi, bigg↔hmdb, bigg↔kegg, bigg↔metanetx.
- Coverage: 2,145 BiGG metabolites with ChEBI (10,319 pairs including ChEBI ontology hierarchy).
- When to force: BiGG metabolite ID translation. Combined with the MetaNetX backend, gives maximum BiGG→ChEBI coverage.
Writing a new backend¶
To add a new data source:
-
Create a module in
omnipath_utils/mapping/backends/(e.g._mybackend.py). -
Subclass
MappingBackend:from omnipath_utils.mapping.backends._base import MappingBackend from omnipath_utils.mapping.backends import register class MyBackend(MappingBackend): name = "mybackend" yaml_key = "mybackend" # key in id_types.yaml def _read_via_pypath( self, id_type, target_id_type, ncbi_tax_id, *, src_col, tgt_col, **kwargs, ): import pypath.inputs.mybackend as mymod # ... fetch and return dict[str, set[str]] raise ImportError # if pypath not available def _read_direct( self, id_type, target_id_type, ncbi_tax_id, *, src_col, tgt_col, **kwargs, ): # Direct HTTP implementation # Return dict[str, set[str]] return {} register("mybackend", MyBackend) -
Add column definitions to
id_types.yamlunder themybackendkey for each supported ID type. -
The backend will be automatically discovered by
_find_backends()if it is a column-based backend, or by the custom backends list if you add it to_CUSTOM_BACKENDSinMapper.
HMDB identifier normalisation¶
HMDB identifiers have two historical formats: the old 5-digit format
(HMDB00001) and the current 7-digit format (HMDB0000001). All
translation APIs (Python and REST) automatically normalise the old format
to the 7-digit form. This normalisation is applied transparently at input
time -- both formats are accepted, and results always use the 7-digit
format.
The normalisation is applied by the mapper before any backend lookup, so
it works consistently across all backends that handle HMDB identifiers
(hmdb, metanetx, bigg, ramp, unichem).
Database tables for special cases¶
During a database build, several auxiliary tables are created beyond
the main id_mapping table:
| Table content | Source -> Target | Purpose |
|---|---|---|
| Gene symbol -> SwissProt | genesymbol -> swissprot |
TrEMBL-to-SwissProt cleanup step |
| TrEMBL -> gene symbol | trembl -> genesymbol |
Reverse lookup in TrEMBL-to-SwissProt cleanup |
| Gene symbol synonyms -> UniProt | genesymbol-syn -> uniprot |
Synonym fallback |
| SwissProt reference list | reflist | Proteome membership check (SwissProt) |
| TrEMBL reference list | reflist | Proteome membership check (TrEMBL) |
| HGNC IDs | hgnc -> uniprot |
Additional protein ID coverage |
| RefSeq protein IDs | refseqp -> uniprot |
Additional protein ID coverage |
| Secondary -> primary UniProt | uniprot-sec -> uniprot-pri |
Obsolete AC resolution |
These tables enable the cleanup pipeline and fallback strategies to run entirely within the database, without loading data into memory.
Caching and performance¶
Memory mode caching¶
Mapping tables are cached as pickle files in
~/.cache/omnipath_utils/mapping/. Each unique combination of
(id_type, target_id_type, ncbi_tax_id, backend) produces a
deterministic filename via MD5 hash:
mapping_genesymbol__uniprot__9606__a1b2c3d4e5f6.pickle
In-memory tables auto-expire after 5 minutes (300 seconds) of
inactivity. The lifetime parameter on the Mapper constructor
controls this. Expired tables are removed on the next
remove_expired() call.
To clear the cache:
rm -rf ~/.cache/omnipath_utils/mapping/
To change the cache directory:
from omnipath_utils.mapping._mapper import Mapper
mapper = Mapper(cachedir='/custom/cache/path')
Database mode performance¶
The database schema uses indexes on (source_type_id, target_type_id,
ncbi_tax_id, source_id) for fast lookups. The COPY command is used
during builds for fast bulk inserts. Partitioning by organism is
available for very large deployments.
Translation performance tips¶
Use batch functions for multiple IDs. translate() and
translate_column() use translate_core() which performs a vectorized
first pass. Calling map_name() in a loop forces every ID through the
full fallback pipeline.
# Good: vectorized first pass, fallback only for misses
result = translate(gene_list, 'genesymbol', 'uniprot')
# Bad: every ID goes through full pipeline
result = {g: map_name(g, 'genesymbol', 'uniprot') for g in gene_list}
Use raw=True when you do not need fallbacks. If your identifiers
are already in canonical form (e.g. uppercase gene symbols, primary
UniProt ACs), raw mode avoids all overhead:
result = translate(clean_gene_list, 'genesymbol', 'uniprot', raw=True)
Pre-fetch the translation table for repeated lookups. If you are translating IDs across multiple DataFrames or in a loop, fetch the table once and reuse it:
from omnipath_utils.mapping import translation_table
table = translation_table('genesymbol', 'uniprot')
# Use table directly as a dict
for gene in genes:
uniprots = table.get(gene, set())
For DataFrames, use translate_column(). It handles deduplication,
batch lookup, and optional row expansion in a single call:
from omnipath_utils.mapping import translate_column
df = translate_column(df, 'gene', 'genesymbol', 'uniprot')
Identify and All-Mappings endpoints¶
GET /mapping/identify¶
Given one or more identifiers, search all mapping tables to find which ID types contain them. Useful when the type of an identifier is unknown.
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
identifiers |
string | yes | Comma-separated identifiers |
ncbi_tax_id |
int | no | NCBI Taxonomy ID (default: 9606) |
curl "https://omnipathdb.org/mapping/identify?\
identifiers=P04637,HMDB0000001"
Response:
{
"results": {
"P04637": [
{"id_type": "uniprot", "role": "source", "mappings_count": 5},
{"id_type": "uniprot", "role": "target", "mappings_count": 1}
],
"HMDB0000001": [
{"id_type": "hmdb", "role": "source", "mappings_count": 3}
]
},
"meta": {
"ncbi_tax_id": 9606,
"total_input": 2
}
}
Each match includes:
- id_type -- the canonical ID type name where the identifier was found.
- role --
"source"or"target", indicating whether the identifier appears as a source or target in mapping rows. - mappings_count -- the number of distinct mapped partners.
GET /mapping/all¶
Given identifiers and their type, return all known mappings to every other target type in a single request.
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
identifiers |
string | yes | Comma-separated identifiers |
id_type |
string | yes | Source ID type |
ncbi_tax_id |
int | no | NCBI Taxonomy ID (default: 9606) |
curl "https://omnipathdb.org/mapping/all?\
identifiers=P04637&\
id_type=uniprot"
Response:
{
"results": {
"P04637": {
"genesymbol": ["TP53"],
"entrez": ["7157"],
"ensg": ["ENSG00000141510"],
"hgnc": ["11998"]
}
},
"meta": {
"id_type": "uniprot",
"ncbi_tax_id": 9606,
"total_input": 1
}
}
Python API¶
Both functions are available from omnipath_utils.mapping:
from omnipath_utils.mapping import identify, all_mappings
# Identify unknown identifiers
identify(["P04637", "HMDB0000001"])
# Get all mappings
all_mappings(["P04637"], "uniprot")
And from the client:
from omnipath_client.utils import identify, all_mappings
identify(["P04637", "HMDB0000001"])
all_mappings(["P04637"], "uniprot")
Both require database mode (PostgreSQL) on the server side.