Advanced ID Translation¶

This page covers the internal workings of the translation system for power users and developers. For the basic API reference, see ID Translation.

Controlling translation behavior¶

Raw mode: bypass all special handling¶

Every translation function accepts raw=True. In raw mode, the translator performs a single direct table lookup -- no case fallbacks, no chain translation, no UniProt cleanup. This is useful when you need maximum speed and know your identifiers are already in the exact form stored in the mapping table.

PythonREST APIDataFrame

from omnipath_utils.mapping import map_name, translate

# Raw: exact match only
map_name('TP53', 'genesymbol', 'uniprot', raw=True)
# {'P04637'}

map_name('tp53', 'genesymbol', 'uniprot', raw=True)
# set()  -- lowercase, no fallback

# Batch raw translation
translate(
    ['TP53', 'EGFR', 'tp53'],
    'genesymbol', 'uniprot',
    raw=True,
)
# {'TP53': {'P04637'}, 'EGFR': {'P00533'}, 'tp53': set()}

curl "https://omnipathdb.org/mapping/translate?\
identifiers=TP53,tp53&\
id_type=genesymbol&\
target_id_type=uniprot&\
raw=true"

The response meta object includes "raw": true to confirm the parameter was applied.

from omnipath_utils.mapping import translate_column

result = translate_column(
    df, 'gene', 'genesymbol', 'uniprot',
    raw=True,
)

What raw mode skips:

Feature	Normal mode	Raw mode
Gene symbol case fallbacks (UPPER, Capitalize)	Yes	No
Gene symbol synonym lookup	Yes	No
Chain translation (source -> uniprot -> target)	Yes	No
UniProt cleanup (secondary -> primary, TrEMBL -> SwissProt)	Yes	No
RefSeq version stripping	Yes	No
Ensembl version stripping	Yes	No
miRNA reciprocal fallback	Yes	No
CURIE prefix stripping	Yes	No
Reverse table scan	Yes	No

Explicit backend selection¶

By default, the mapper auto-selects a backend based on which data sources support the requested ID type pair. You can force a specific backend with the backend= parameter.

PythonREST API

from omnipath_utils.mapping import map_name, translate

# Force BioMart for Ensembl-centric lookups
map_name('TP53', 'genesymbol', 'ensg', backend='biomart')

# Force UniProt REST for protein IDs
translate(
    ['P04637', 'P00533'],
    'uniprot', 'genesymbol',
    backend='uniprot',
)

curl "https://omnipathdb.org/mapping/translate?\
identifiers=TP53&\
id_type=genesymbol&\
target_id_type=uniprot&\
backend=biomart"

Note

The REST API uses the database backend for lookups. The backend parameter is recorded in the response metadata but does not change the query behavior in database mode. It is primarily meaningful for the Python API.

Available backends:

Backend	Data source	When to force it
`uniprot`	UniProt REST API	Default for protein IDs; most comprehensive protein cross-refs
`uniprot_ftp`	UniProt FTP idmapping files	Bulk download; faster for large-scale builds
`uploadlists`	UniProt ID Mapping batch service	Specific, targeted ID sets
`biomart`	Ensembl BioMart	When you need fresh Ensembl data or Ensembl-specific ID types
`mirbase`	miRBase	miRNA names and accessions
`unichem`	UniChem (EMBL-EBI)	Chemical compound cross-references
`ramp`	RaMP-DB	Metabolite cross-references and synonym mappings
`hmdb`	HMDB	HMDB-centric metabolite mappings
`metanetx`	MNXref	Pairwise metabolite ID translation via MetaNetX bridge (bigg↔chebi, kegg↔chebi, hmdb↔chebi, lipidmaps↔chebi, swisslipids↔chebi, metanetx↔*)
`bigg`	BiGG Models	BiGG metabolite mappings (bigg↔chebi, bigg↔hmdb, bigg↔kegg, bigg↔metanetx)

When backend= is specified, the mapper skips its cached table and forces a reload from the requested backend. This is useful when the auto-selected backend returned incomplete data and you want to try a different source.

Querying SwissProt, TrEMBL, and synonyms directly¶

The uniprot target type runs the full cleanup pipeline (secondary AC resolution, TrEMBL-to-SwissProt preference, proteome filtering, format validation). To bypass this and query specific subsets of UniProt, use the dedicated target types:

from omnipath_utils.mapping import map_name

# Default: prefers SwissProt, runs full cleanup
map_name('TP53', 'genesymbol', 'uniprot')
# {'P04637'}  -- SwissProt entry

# Only reviewed (SwissProt) entries, no cleanup
map_name('TP53', 'genesymbol', 'swissprot')
# {'P04637'}

# Only unreviewed (TrEMBL) entries, no cleanup
map_name('TP53', 'genesymbol', 'trembl')
# TrEMBL entries for TP53, if any exist

# Gene symbol synonyms
map_name('p53', 'genesymbol-syn', 'uniprot')
# Looks up 'p53' as a synonym; may find TP53's UniProt AC

Target type behavior:

Target type	Source data	Cleanup pipeline	Typical use
`uniprot`	SwissProt + TrEMBL	Yes (full 4-step)	Default; production use
`swissprot`	SwissProt only	No	When you specifically need reviewed entries
`trembl`	TrEMBL only	No	When you need unreviewed entries
`genesymbol-syn`	Gene symbol synonym table	No (source type)	Looking up old/alternative gene names

The translation pipeline in detail¶

Step-by-step walkthrough¶

Here is what happens when you call:

map_name('tp53', 'genesymbol', 'uniprot')

Step 1: Alias resolution. IdTypeRegistry.resolve() normalizes the type names. 'genesymbol' is already canonical. Variants like 'GeneSymbol', 'gene_symbol', or 'genesymbol_syn' would be resolved to their canonical forms.

Step 2: Direct lookup. The mapper looks up 'tp53' in the genesymbol -> uniprot table. Tables are case-sensitive, so 'tp53' is not found (the table has 'TP53'). Result: miss.

Step 3: Gene symbol fallbacks. Since id_type is 'genesymbol', the fallback chain runs:

'tp53'.upper() = 'TP53' -- lookup finds {'P04637', 'A0A024R1R8', ...}. Match found; remaining fallbacks are skipped.

If UPPER had failed, the system would try:

'tp53'.capitalize() = 'Tp53' (for rodent symbols)
Synonym table lookup for 'tp53' and 'TP53'
Append "1": 'tp531' (non-strict mode only)

Step 4: UniProt cleanup. Since the target is 'uniprot', the cleanup pipeline runs on the raw result set {'P04637', 'A0A024R1R8', ...}:

Secondary -> primary: Each AC is checked against the uniprot-sec -> uniprot-pri table. If it is a secondary AC, it is replaced with the current primary AC.
TrEMBL -> SwissProt: A0A024R1R8 is not in the SwissProt reference list. The pipeline looks up its gene symbol via the trembl -> genesymbol table, gets 'TP53', then finds the SwissProt entry for 'TP53': 'P04637'. The TrEMBL AC is replaced.
Proteome filter: The result set is intersected with the human proteome reference list. P04637 is present; stale or cross-organism ACs are removed.
Format validation: Each AC is checked against the UniProt AC regex. Invalid strings are discarded.

Step 5: Result. {'P04637'} -- just the SwissProt entry.

Vectorization¶

The translate() and translate_column() functions use translate_core() internally, which implements a two-pass strategy:

First pass (batch): All identifiers are looked up in the mapping table in a single sweep. This is an O(1) dict lookup per ID. Identifiers that get a direct hit are collected immediately.
Second pass (per-ID fallback): Only identifiers that missed in the first pass go through the full map_name() pipeline with all its fallback strategies.

This means: if your mapping table covers 99% of your identifiers, only 1% go through the slower per-ID fallback path. For typical gene symbol-to-UniProt translations with properly cased input, the first pass handles nearly everything.

# Efficient: translate_core handles the batch
from omnipath_utils.mapping import translate
result = translate(gene_list, 'genesymbol', 'uniprot')

# Less efficient: each call goes through the full pipeline
from omnipath_utils.mapping import map_name
result = {g: map_name(g, 'genesymbol', 'uniprot') for g in gene_list}

When the target type is uniprot, cleanup runs on the batch results from the first pass as well (not just the fallback results), ensuring consistent output regardless of which pass produced the hit.

Memory mode vs Database mode¶

The Python API uses memory mode: mapping tables are downloaded from upstream APIs (UniProt, BioMart, etc.), stored as dict[str, set[str]] in memory, and cached as pickle files on disk. This is fast for interactive use and small to medium batch jobs.

The REST API uses database mode: mapping tables are pre-loaded into PostgreSQL during a build step. Translation queries run as SQL queries against indexed tables. This handles concurrent requests and very large ID sets efficiently.

Both modes share the same fallback logic and cleanup pipeline. The difference is the data access layer:

Aspect	Memory mode	Database mode
Data loading	On-demand from upstream APIs	Pre-built into PostgreSQL
Lookup	Python dict (O(1))	SQL query (indexed)
Fallbacks	`map_name()` per-ID pipeline	Batch SQL queries per fallback step
Cleanup	Same pipeline	Same pipeline
Caching	Pickle files, 5-min memory expiry	Database is the cache
Concurrency	Single-process	Multi-process via connection pool

REST API fallback chain¶

The REST API implements the same fallback strategies as the Python API, but uses batch SQL queries instead of per-ID lookups.

How REST fallbacks work¶

Direct DB query: All input IDs are looked up in a single SQL query against the id_mapping table.
Gene symbol fallbacks (batch): For unmapped IDs where the source type is genesymbol:
- Uppercase batch: all unmapped IDs are uppercased and queried
- Capitalize batch: remaining unmapped IDs are capitalized and queried
- Synonym batch: remaining unmapped IDs are queried against genesymbol-syn
- Uppercase synonym batch: remaining unmapped IDs uppercased and queried against genesymbol-syn
Chain translation (batch): For unmapped IDs where neither side is uniprot: a batch intermediate query (source -> uniprot), then a batch final query (uniprot -> target).
UniProt cleanup: Same pipeline as the Python API (secondary -> primary, TrEMBL -> SwissProt, proteome filter, format validation).

When raw=true is passed, steps 2-4 are skipped entirely.

REST query parameters reference¶

Parameter	Type	Default	Description
`identifiers`	string	required	Comma-separated IDs (GET) or JSON list (POST)
`id_type`	string	required	Source ID type
`target_id_type`	string	required	Target ID type
`ncbi_tax_id`	int	9606	Organism NCBI Taxonomy ID
`raw`	bool	false	Skip all special-case handling
`backend`	string	null	Force specific backend (metadata only in DB mode)

GET examples:

# Basic translation
curl "https://omnipathdb.org/mapping/translate?\
identifiers=TP53,EGFR&\
id_type=genesymbol&\
target_id_type=uniprot"

# Raw mode -- direct lookup only
curl "https://omnipathdb.org/mapping/translate?\
identifiers=TP53&\
id_type=genesymbol&\
target_id_type=uniprot&\
raw=true"

# Specify organism (mouse)
curl "https://omnipathdb.org/mapping/translate?\
identifiers=Trp53&\
id_type=genesymbol&\
target_id_type=uniprot&\
ncbi_tax_id=10090"

POST example:

curl -X POST "https://omnipathdb.org/mapping/translate" \
     -H "Content-Type: application/json" \
     -d '{
       "identifiers": ["TP53", "EGFR", "BRCA1"],
       "id_type": "genesymbol",
       "target_id_type": "uniprot",
       "raw": false,
       "backend": null
     }'

Backend details¶

Auto-selection algorithm¶

When no backend is specified, the mapper's _find_backends() method builds a candidate list:

Column-based backends (uniprot, uniprot_ftp, biomart): For each backend, it checks whether both the source and target ID types have a column defined in id_types.yaml under that backend's key. If both columns exist, the backend is added to the candidate list.
Custom backends (mirbase, unichem, ramp, hmdb, metanetx, bigg): These are always appended to the candidate list. They perform their own internal support check in their read() method and return an empty dict if the ID type pair is not supported.

The mapper tries each candidate in order. The first backend that returns non-empty data wins.

Backend-specific behavior¶

uniprot¶

Data: UniProt AC, gene symbols, Entrez, HGNC, RefSeq, PDB, and all cross-references stored in UniProt entries.
Access: pypath.inputs.uniprot (preferred) or direct HTTP to the UniProt REST API.
Organism-specific: Yes. Queries are filtered by NCBI taxonomy ID.
Caching: Pickle file per (source, target, organism) triple.
When to force: Default for most protein-related translations.

uniprot_ftp¶

Data: Same scope as uniprot, but downloaded from UniProt FTP idmapping files (bulk, pre-computed).
Access: pypath.inputs.uniprot_ftp or direct HTTP to UniProt FTP.
Organism-specific: Yes, but only 12 model organisms have pre-computed files.
Caching: Same as uniprot.
When to force: Faster for full-proteome builds; useful when the REST API is slow or rate-limited.

uploadlists¶

Data: Same scope as UniProt REST.
Access: Direct HTTP to the UniProt ID Mapping batch service (submit job, poll, collect results).
Organism-specific: Yes.
When to force: When you have a specific, bounded set of IDs and want the most up-to-date cross-references.

biomart¶

Data: Ensembl gene/transcript/protein IDs, gene symbols, Entrez.
Access: pypath.inputs.biomart or direct HTTP to the Ensembl BioMart XML service.
Organism-specific: Yes.
Caching: Pickle file per triple.
When to force: When translating Ensembl IDs or when you need fresh Ensembl-specific data.

mirbase¶

Data: Precursor names (mir-pre), mature names (mir-mat-name), miRBase accessions (mirbase).
Access: pypath.inputs.mirbase (requires pypath).
Organism-specific: Yes.
When to force: miRNA translations.

unichem¶

Data: Cross-references between chemical databases (ChEMBL, ChEBI, DrugBank, PubChem, KEGG, etc.).
Access: pypath.inputs.unichem (requires pypath).
Organism-specific: No (chemicals are universal).
When to force: Chemical compound ID translation.

ramp¶

Data: Metabolite cross-references and synonym mappings from RaMP-DB.
Access: pypath.inputs.ramp (requires pypath).
Organism-specific: No.
When to force: Metabolite ID translation, especially when you need common-name-to-database-ID resolution.

hmdb¶

Data: HMDB, PubChem, ChEBI, DrugBank, KEGG compound.
Access: pypath.inputs.hmdb (requires pypath).
Organism-specific: Human-derived data, but identifiers are universal.
When to force: HMDB-centric metabolite lookups.

metanetx¶

Data: Pairwise metabolite ID cross-references from MNXref chem_xref.tsv (3.4M cross-reference entries).
Access: pypath.inputs.metanetx.metanetx_mapping() (requires pypath).
Organism-specific: No (chemicals are universal).
Supported pairs: bigg↔chebi, kegg↔chebi, hmdb↔chebi, lipidmaps↔chebi, swisslipids↔chebi, and all metanetx↔* combinations.
Coverage: 82K hmdb→chebi, 45K kegg→chebi, 23K lipidmaps→chebi, 11K bigg→chebi via MetaNetX bridge.
When to force: Metabolite ID translation, especially for lipidmaps→chebi and swisslipids→chebi which are not available in other backends.

bigg¶

Data: BiGG Models universal metabolite TSV (9,090 universal metabolites across 85+ models).
Access: pypath.inputs.bigg.bigg_metabolite_mapping() (requires pypath).
Organism-specific: No (chemicals are universal).
Supported pairs: bigg↔chebi, bigg↔hmdb, bigg↔kegg, bigg↔metanetx.
Coverage: 2,145 BiGG metabolites with ChEBI (10,319 pairs including ChEBI ontology hierarchy).
When to force: BiGG metabolite ID translation. Combined with the MetaNetX backend, gives maximum BiGG→ChEBI coverage.

Writing a new backend¶

To add a new data source:

Create a module in omnipath_utils/mapping/backends/ (e.g. _mybackend.py).

Subclass MappingBackend:

from omnipath_utils.mapping.backends._base import MappingBackend
from omnipath_utils.mapping.backends import register


class MyBackend(MappingBackend):
    name = "mybackend"
    yaml_key = "mybackend"  # key in id_types.yaml

    def _read_via_pypath(
        self, id_type, target_id_type, ncbi_tax_id,
        *, src_col, tgt_col, **kwargs,
    ):
        import pypath.inputs.mybackend as mymod
        # ... fetch and return dict[str, set[str]]
        raise ImportError  # if pypath not available

    def _read_direct(
        self, id_type, target_id_type, ncbi_tax_id,
        *, src_col, tgt_col, **kwargs,
    ):
        # Direct HTTP implementation
        # Return dict[str, set[str]]
        return {}

register("mybackend", MyBackend)

Add column definitions to id_types.yaml under the mybackend key for each supported ID type.
The backend will be automatically discovered by _find_backends() if it is a column-based backend, or by the custom backends list if you add it to _CUSTOM_BACKENDS in Mapper.

HMDB identifier normalisation¶

HMDB identifiers have two historical formats: the old 5-digit format (HMDB00001) and the current 7-digit format (HMDB0000001). All translation APIs (Python and REST) automatically normalise the old format to the 7-digit form. This normalisation is applied transparently at input time -- both formats are accepted, and results always use the 7-digit format.

The normalisation is applied by the mapper before any backend lookup, so it works consistently across all backends that handle HMDB identifiers (hmdb, metanetx, bigg, ramp, unichem).

Database tables for special cases¶

During a database build, several auxiliary tables are created beyond the main id_mapping table:

Table content	Source -> Target	Purpose
Gene symbol -> SwissProt	`genesymbol -> swissprot`	TrEMBL-to-SwissProt cleanup step
TrEMBL -> gene symbol	`trembl -> genesymbol`	Reverse lookup in TrEMBL-to-SwissProt cleanup
Gene symbol synonyms -> UniProt	`genesymbol-syn -> uniprot`	Synonym fallback
SwissProt reference list	reflist	Proteome membership check (SwissProt)
TrEMBL reference list	reflist	Proteome membership check (TrEMBL)
HGNC IDs	`hgnc -> uniprot`	Additional protein ID coverage
RefSeq protein IDs	`refseqp -> uniprot`	Additional protein ID coverage
Secondary -> primary UniProt	`uniprot-sec -> uniprot-pri`	Obsolete AC resolution

These tables enable the cleanup pipeline and fallback strategies to run entirely within the database, without loading data into memory.

Caching and performance¶

Memory mode caching¶

Mapping tables are cached as pickle files in ~/.cache/omnipath_utils/mapping/. Each unique combination of (id_type, target_id_type, ncbi_tax_id, backend) produces a deterministic filename via MD5 hash:

mapping_genesymbol__uniprot__9606__a1b2c3d4e5f6.pickle

In-memory tables auto-expire after 5 minutes (300 seconds) of inactivity. The lifetime parameter on the Mapper constructor controls this. Expired tables are removed on the next remove_expired() call.

To clear the cache:

rm -rf ~/.cache/omnipath_utils/mapping/

To change the cache directory:

from omnipath_utils.mapping._mapper import Mapper
mapper = Mapper(cachedir='/custom/cache/path')

Database mode performance¶

The database schema uses indexes on (source_type_id, target_type_id, ncbi_tax_id, source_id) for fast lookups. The COPY command is used during builds for fast bulk inserts. Partitioning by organism is available for very large deployments.

Translation performance tips¶

Use batch functions for multiple IDs. translate() and translate_column() use translate_core() which performs a vectorized first pass. Calling map_name() in a loop forces every ID through the full fallback pipeline.

# Good: vectorized first pass, fallback only for misses
result = translate(gene_list, 'genesymbol', 'uniprot')

# Bad: every ID goes through full pipeline
result = {g: map_name(g, 'genesymbol', 'uniprot') for g in gene_list}

Use raw=True when you do not need fallbacks. If your identifiers are already in canonical form (e.g. uppercase gene symbols, primary UniProt ACs), raw mode avoids all overhead:

result = translate(clean_gene_list, 'genesymbol', 'uniprot', raw=True)

Pre-fetch the translation table for repeated lookups. If you are translating IDs across multiple DataFrames or in a loop, fetch the table once and reuse it:

from omnipath_utils.mapping import translation_table

table = translation_table('genesymbol', 'uniprot')
# Use table directly as a dict
for gene in genes:
    uniprots = table.get(gene, set())

For DataFrames, use translate_column(). It handles deduplication, batch lookup, and optional row expansion in a single call:

from omnipath_utils.mapping import translate_column

df = translate_column(df, 'gene', 'genesymbol', 'uniprot')

Identify and All-Mappings endpoints¶

GET /mapping/identify¶

Given one or more identifiers, search all mapping tables to find which ID types contain them. Useful when the type of an identifier is unknown.

Parameters:

Parameter	Type	Required	Description
`identifiers`	string	yes	Comma-separated identifiers
`ncbi_tax_id`	int	no	NCBI Taxonomy ID (default: 9606)

curl "https://omnipathdb.org/mapping/identify?\
identifiers=P04637,HMDB0000001"

Response:

{
    "results": {
        "P04637": [
            {"id_type": "uniprot", "role": "source", "mappings_count": 5},
            {"id_type": "uniprot", "role": "target", "mappings_count": 1}
        ],
        "HMDB0000001": [
            {"id_type": "hmdb", "role": "source", "mappings_count": 3}
        ]
    },
    "meta": {
        "ncbi_tax_id": 9606,
        "total_input": 2
    }
}

Each match includes:

id_type -- the canonical ID type name where the identifier was found.
role -- "source" or "target", indicating whether the identifier appears as a source or target in mapping rows.
mappings_count -- the number of distinct mapped partners.

GET /mapping/all¶

Given identifiers and their type, return all known mappings to every other target type in a single request.

Parameters:

Parameter	Type	Required	Description
`identifiers`	string	yes	Comma-separated identifiers
`id_type`	string	yes	Source ID type
`ncbi_tax_id`	int	no	NCBI Taxonomy ID (default: 9606)

curl "https://omnipathdb.org/mapping/all?\
identifiers=P04637&\
id_type=uniprot"

Response:

{
    "results": {
        "P04637": {
            "genesymbol": ["TP53"],
            "entrez": ["7157"],
            "ensg": ["ENSG00000141510"],
            "hgnc": ["11998"]
        }
    },
    "meta": {
        "id_type": "uniprot",
        "ncbi_tax_id": 9606,
        "total_input": 1
    }
}

Python API¶

Both functions are available from omnipath_utils.mapping:

from omnipath_utils.mapping import identify, all_mappings

# Identify unknown identifiers
identify(["P04637", "HMDB0000001"])

# Get all mappings
all_mappings(["P04637"], "uniprot")

And from the client:

from omnipath_client.utils import identify, all_mappings

identify(["P04637", "HMDB0000001"])
all_mappings(["P04637"], "uniprot")

Both require database mode (PostgreSQL) on the server side.