Working with data#

CORNETO provides a flexible way to work with data in a convenient way for network problems. You can use the Data class to create a new dataset, or you can load an existing dataset from a file. The Data class provides methods for querying a given dataset. The main goal of this class is to provide a simple interface for working with datasets that are typically sparse and can be stored in memory, while still allowing for complex queries and transformations.

import corneto as cn

cn.info()
Installed version:v1.0.0.dev5 (latest stable: v1.0.0-alpha)
Available backends:CVXPY v1.6.5, PICOS v2.6.1
Default backend (corneto.opt):CVXPY
Installed solvers:CVXOPT, GLPK, GLPK_MI, HIGHS, SCIP, SCIPY
Graphviz version:v0.20.3
Installed path:/home/runner/work/corneto/corneto/corneto
Repository:https://github.com/saezlab/corneto

Creating a new dataset from a dictionary#

Most of the network inference approaches in CORNETO require some type of measurements to be mapped to prior knowledge networks. For convenience, datasets used in CORNETO are represented as dictionaries with features (any measurement that is mapped, or will be mapped to a prior knowledge network), and samples, which are collections of features. The simplest way to create a dataset is to define a dictionary where keys are sample names (e.g. conditions like different perturbed cells). Each sample has features, which are lists of dictionaries containing information about the feature. Every feature has at least an id (e.g., gene name, protein name, metabolite name, etc), a value for that sample, and a mapping attribute which indicates if this feature maps to a vertex, edge, or none (for features that are not mapped to the network, or needs pre-processing to determine the mapping values). Every method may require additional attributes, which are described in the documentation of each method. The following example shows how to create a simple dataset:

# A simple dataset with two samples, and three features per sample.

samples = {
    "sample1": {
        "features": [
            {"id": "receptor1", "value": 1, "mapping": "vertex", "role": "input"},
            {"id": "tf1", "value": 1, "mapping": "vertex", "role": "output"},
            {"id": "tf2", "value": 1, "mapping": "vertex", "role": "output"},
        ]
    },
    "sample2": {
        "features": [
            {"id": "receptor2", "value": 1, "mapping": "vertex", "role": "input"},
            {"id": "tf1", "value": 1, "mapping": "vertex", "role": "output"},
            {"id": "tf2", "value": -1, "mapping": "vertex", "role": "output"},
        ]
    },
}

data = cn.Data.from_dict(samples)
data
Data(n_samples=2, n_feats=[3 3])

For convenience, data can be imported from a more compacted dictionary definition where features are also indexed by id.

samples = {
    "sample1": {
        "receptor1": {
            "value": 1,
            "mapping": "vertex",
            "role": "input",
        },
        "tf1": {
            "value": 1,
            "mapping": "vertex",
            "role": "output",
        },
        "tf2": {
            "value": 1,
            "mapping": "vertex",
            "role": "output",

        },
    },
    "sample2": {
        "receptor2": {
            "value": 1,
            "mapping": "vertex",
            "role": "input",
        },
        "tf1": {
            "value": 1,
            "mapping": "vertex",
            "role": "output",
        },
        "tf3": {
            "value": -1,
            "mapping": "vertex",
            "role": "output",
        },
        "tf4": {
            "value": 1,
            "mapping": "vertex",
            "role": "output",
        },
    },
}

data = cn.Data.from_cdict(samples)
data
Data(n_samples=2, n_feats=[3 4])
# Samples are just lists of features
data.samples['sample1'].features
[Feature(id=receptor1, value=1, mapping=vertex, role=input),
 Feature(id=tf1, value=1, mapping=vertex, role=output),
 Feature(id=tf2, value=1, mapping=vertex, role=output)]
# We can add a new feature to a sample. These are stored as lists.
data.samples['sample1'].features.append(
    cn.Feature(id="tf3", value=1, mapping="vertex", role="output")
)
data.to_dict()
{'sample1': {'features': [{'id': 'receptor1',
    'value': 1,
    'mapping': 'vertex',
    'role': 'input'},
   {'id': 'tf1', 'value': 1, 'mapping': 'vertex', 'role': 'output'},
   {'id': 'tf2', 'value': 1, 'mapping': 'vertex', 'role': 'output'},
   {'id': 'tf3', 'value': 1, 'mapping': 'vertex', 'role': 'output'}]},
 'sample2': {'features': [{'id': 'receptor2',
    'value': 1,
    'mapping': 'vertex',
    'role': 'input'},
   {'id': 'tf1', 'value': 1, 'mapping': 'vertex', 'role': 'output'},
   {'id': 'tf3', 'value': -1, 'mapping': 'vertex', 'role': 'output'},
   {'id': 'tf4', 'value': 1, 'mapping': 'vertex', 'role': 'output'}]}}

Querying features#

To inspect or obtain features from a dataset, you can use the query method. This method allows you to filter features based on their attributes. For example, you can find all features that are mapped to a specific vertex in the network, or all features that have a specific value.

data.samples['sample1'].query.filter(
    lambda f: f.id.startswith("tf")
).collect()
Sample(n_feats=3)
# We can get all the unique ids from the samples
data.samples['sample1'].query.filter(
    lambda f: f.id.startswith("tf")
).pluck()
{'tf1', 'tf2', 'tf3'}
# Get features with unique ids
data.samples['sample1'].query.unique().to_list()
[Feature(id=receptor1, value=1, mapping=vertex, role=input),
 Feature(id=tf1, value=1, mapping=vertex, role=output),
 Feature(id=tf2, value=1, mapping=vertex, role=output),
 Feature(id=tf3, value=1, mapping=vertex, role=output)]

The unique() method can use different keys to determine uniqueness. By default, it uses the feature’s id field, but you can specify other fields like mapping, role, or any combination of keys.

# Get features with unique roles
data.samples['sample1'].query.unique(["role"]).to_list()
[Feature(id=receptor1, value=1, mapping=vertex, role=input),
 Feature(id=tf1, value=1, mapping=vertex, role=output)]

Working across multiple samples#

The Data class also provides a query interface that works across all samples. This allows you to perform operations on all features in all samples at once.

# Get all unique feature IDs across all samples
data.query.pluck()
{'receptor1', 'receptor2', 'tf1', 'tf2', 'tf3', 'tf4'}
# Get features with unique IDs across all samples
data.query.filter_features(
    lambda f: f.id.startswith("tf")
).unique().pluck()
{'tf1', 'tf2', 'tf3', 'tf4'}
# Filter features across all samples
dataq = data.query.filter_features(
    lambda f: f.id.startswith("tf")
).collect()

dataq
Data(n_samples=2, n_feats=[3 3])
dataq.to_dict()
{'sample1': {'features': [{'id': 'tf1',
    'value': 1,
    'mapping': 'vertex',
    'role': 'output'},
   {'id': 'tf2', 'value': 1, 'mapping': 'vertex', 'role': 'output'},
   {'id': 'tf3', 'value': 1, 'mapping': 'vertex', 'role': 'output'}]},
 'sample2': {'features': [{'id': 'tf1',
    'value': 1,
    'mapping': 'vertex',
    'role': 'output'},
   {'id': 'tf3', 'value': -1, 'mapping': 'vertex', 'role': 'output'},
   {'id': 'tf4', 'value': 1, 'mapping': 'vertex', 'role': 'output'}]}}

Saving and loading datasets#

import tempfile

file = tempfile.NamedTemporaryFile(delete=False).name
dataq.save(file, compression="xz")
dataq.load(file, compression="xz").to_dict()
{'sample1': {'features': [{'id': 'tf1',
    'value': 1,
    'mapping': 'vertex',
    'role': 'output'},
   {'id': 'tf2', 'value': 1, 'mapping': 'vertex', 'role': 'output'},
   {'id': 'tf3', 'value': 1, 'mapping': 'vertex', 'role': 'output'}]},
 'sample2': {'features': [{'id': 'tf1',
    'value': 1,
    'mapping': 'vertex',
    'role': 'output'},
   {'id': 'tf3', 'value': -1, 'mapping': 'vertex', 'role': 'output'},
   {'id': 'tf4', 'value': 1, 'mapping': 'vertex', 'role': 'output'}]}}