corneto.data.Data#

class corneto.data.Data(dict=None, /, **kwargs)#

Bases: UserDict[Any, Sample]

A dataset container that maps sample IDs to Sample objects.

This class provides methods to add, update, delete, and manipulate samples and their associated features. It supports conversions from and to various data formats such as dictionaries and tight formats.

data#

A mapping from sample identifiers to Sample objects.

Type:: Dict[Any, Sample]

Examples

Create a new dataset and add a sample with features:

>>> dataset = Data()
>>> dataset.add_sample("cell1", {"TP53": -1, "BRCA1": 0.5})
>>> print(dataset)
Dataset(num_samples=1)

Create a dataset from a dictionary:

>>> raw = {
...     "cell1": {"TP53": -1, "BRCA1": 0.5},
...     "cell2": {"TP53": 0, "BRCA1": -0.2},
... }
>>> dataset = Data.from_dict(raw)
>>> print(dataset)
Dataset(num_samples=2)

__init__(dict=None, /, **kwargs)#

Methods

`__init__`([dict])
`add_sample`(sample_id[, features])	Add a new sample with the given features to the dataset.
`clear`()
`collect_features`(metadata_key, metadata_value, *)	Collect feature names or values across samples that have a specific metadata key-value pair.
`copy`()	Create a shallow copy of the dataset.
`delete_sample`(sample_id)	Delete a sample from the dataset by its ID.
`filter`(predicate)	Filter features across all samples based on a predicate function.
`filter_by`(key, value)	Filter features across all samples that have metadata matching the given key and value.
`filter_samples`(predicate)	Filter samples based on a predicate function.
`from_dict`(raw_data)	Create a Data instance from a nested dictionary.
`from_sample_value_dict`(condition_dict[, ...])	Create a Data instance from a nested dictionary of sample IDs to feature dictionaries.
`from_tight_format`(tight_data)	Create a Data instance from a tight format dictionary.
`fromkeys`(iterable[, value])
`get`(k[,d])
`get_feature_across_samples`(feature_name)	Retrieve a specific feature from all samples that contain it.
`items`()
`keys`()
`merge`(other)	Merge another Data instance into this one.
`pop`(k[,d])	If key is not found, d is returned if given, otherwise KeyError is raised.
`popitem`()	as a 2-tuple; but raise KeyError if D is empty.
`setdefault`(k[,d])
`subset_features`(feature_list)	Create a new Data instance containing only the specified features.
`to_sample_value_dict`([value_key])	Convert the dataset to a nested dictionary of sample IDs mapping to feature dictionaries, where each feature is represented as a dictionary that includes a value and optional metadata.
`to_tight_format`()	Convert the dataset to a "tight" format, which is a dictionary with lists for samples, features, values, and metadata.
`update`([E, ]**F)	If E present and has a .keys() method, does: for k in E: D[k] = E[k] If E present and lacks .keys() method, does: for (k, v) in E: D[k] = v In either case, this is followed by: for k, v in F.items(): D[k] = v
`update_sample`(sample_id, features)	Update the features of an existing sample or add a new sample if it does not exist.
`values`()

add_sample(sample_id, features=None)#

Add a new sample with the given features to the dataset.

Parameters:

sample_id (str) – Unique identifier for the sample.
features (Optional[Dict[str, Any]]) – A dictionary of feature names and their values. Defaults to None.

Raises:

ValueError – If a sample with the given ID already exists.

Return type:

None

Examples

>>> dataset = Data()
>>> dataset.add_sample("cell1", {"TP53": -1})
>>> "cell1" in dataset.data
True

classmethod from_dict(raw_data)#

Create a Data instance from a nested dictionary.

Each key in raw_data is interpreted as a sample ID, and its value is a dictionary mapping feature names to feature data.

Parameters:: raw_data (Dict[str, Dict[str, Any]]) – Raw data dictionary.
Returns:: A new instance of Data containing the samples and features.
Return type:: Data

Examples

>>> raw = {
...     "cell1": {"TP53": -1, "BRCA1": 0.5},
...     "cell2": {"TP53": 0}
... }
>>> dataset = Data.from_dict(raw)
>>> print(dataset)
Dataset(num_samples=2)

filter(predicate)#

Filter features across all samples based on a predicate function.

Parameters:

predicate (Callable[[str, str, Any], bool]) – A function that takes a sample ID, feature name, and feature value, returning True if the feature should be included.

Returns:

A new Data instance containing only the features that satisfy the predicate,: with empty samples excluded.

Return type:

Data

Examples

>>> dataset = Data.from_dict({
...     "cell1": {"TP53": -1, "BRCA1": 0.5},
...     "cell2": {"TP53": 0, "EGFR": 1.2},
... })
>>> filtered = dataset.filter(lambda sid, fname, fvalue: isinstance(fvalue, (int, float)) and fvalue > 0)
>>> filtered.data["cell1"].features
{'BRCA1': 0.5}
>>> filtered.data["cell2"].features
{'EGFR': 1.2}

filter_by(key, value)#

Filter features across all samples that have metadata matching the given key and value.

Parameters:

key (str) – The metadata key to filter by.
value (Any) – The value that the metadata key should match.

Returns:

A new Data instance containing only the features with matching metadata,: with empty samples excluded.

Return type:

Data

Examples

>>> dataset = Data.from_dict({
...     "cell1": {
...         "TP53": {"value": -1, "type": "tumor_suppressor"},
...         "EGFR": {"value": 0.8, "type": "oncogene"}
...     },
...     "cell2": {
...         "KRAS": {"value": 1.2, "type": "oncogene"},
...         "BRCA1": {"value": -0.5, "type": "tumor_suppressor"}
...     }
... })
>>> filtered = dataset.filter_by("type", "oncogene")
>>> filtered.data["cell1"].features
{'EGFR': {'value': 0.8, 'type': 'oncogene'}}
>>> filtered.data["cell2"].features
{'KRAS': {'value': 1.2, 'type': 'oncogene'}}

subset_features(feature_list)#

Create a new Data instance containing only the specified features.

Parameters:: feature_list (List[str]) – List of feature names to retain in the subset.
Returns:: A new dataset instance with samples containing only the allowed features.
Return type:: Data

Examples

>>> dataset = Data.from_dict({
...     "cell1": {"TP53": -1, "BRCA1": 0.5, "EGFR": 1.2},
...     "cell2": {"TP53": 0, "BRCA1": -0.2, "KRAS": 0.8}
... })
>>> subset = dataset.subset_features(["TP53", "KRAS"])
>>> subset.data["cell1"].features
{'TP53': -1}

to_tight_format()#

Convert the dataset to a “tight” format, which is a dictionary with lists for samples, features, values, and metadata.

The tight format organizes data in parallel lists where each index corresponds to a particular feature of a sample.

Returns:: A dictionary with keys “sample”, “feature”, “value”, and “metadata”.
Return type:: Dict[str, List[Any]]

Examples

>>> dataset = Data.from_dict({
...     "cell1": {"TP53": {"value": -1, "confidence": 0.9}, "BRCA1": 0.5}
... })
>>> tight = dataset.to_tight_format()
>>> tight["sample"]
['cell1', 'cell1']

classmethod from_tight_format(tight_data)#

Create a Data instance from a tight format dictionary.

Parameters:: tight_data (Dict[str, List[Any]]) – A dictionary with keys “sample”, “feature”, “value”, and “metadata” that contains parallel lists.
Returns:: A new Data instance constructed from the tight format data.
Return type:: Data

Examples

>>> tight = {
...     "sample": ["cell1", "cell1"],
...     "feature": ["TP53", "BRCA1"],
...     "value": [-1, 0.5],
...     "metadata": [{"confidence": 0.9}, None]
... }
>>> dataset = Data.from_tight_format(tight)
>>> dataset.data["cell1"].features["TP53"]
{'value': -1, 'confidence': 0.9}

to_sample_value_dict(value_key='value')#

Convert the dataset to a nested dictionary of sample IDs mapping to feature dictionaries, where each feature is represented as a dictionary that includes a value and optional metadata.

Parameters:

value_key (str, optional) – The key to use for feature values. Defaults to “value”.

Returns:

A nested dictionary mapping sample IDs to features and their: associated dictionaries.

Return type:

Dict[str, Dict[str, dict]]

Examples

>>> dataset = Data.from_dict({
...     "cell1": {"TP53": {"value": -1, "confidence": 0.9}, "BRCA1": 0.5}
... })
>>> sv_dict = dataset.to_sample_value_dict()
>>> sv_dict["cell1"]["TP53"]
{'value': -1, 'confidence': 0.9}

classmethod from_sample_value_dict(condition_dict, value_key='value')#

Create a Data instance from a nested dictionary of sample IDs to feature dictionaries.

Each feature dictionary must contain the specified value_key.

Parameters:

condition_dict (Dict[str, Dict[str, dict]]) – A nested dictionary where each key is a sample ID, and each value is a dictionary mapping feature names to feature dictionaries.
value_key (str, optional) – The key that must be present in each feature dictionary. Defaults to “value”.

Returns:

A new Data instance populated with the provided features.

Return type:

Data

Raises:

ValueError – If a feature in a sample is missing the value_key.

Examples

>>> condition = {
...     "cell1": {
...         "TP53": {"value": -1, "confidence": 0.9},
...         "BRCA1": {"value": 0.5}
...     }
... }
>>> dataset = Data.from_sample_value_dict(condition)
>>> dataset.data["cell1"].features["TP53"]
{'value': -1, 'confidence': 0.9}

delete_sample(sample_id)#

Delete a sample from the dataset by its ID.

Parameters:: sample_id (str) – The ID of the sample to delete.
Raises:: KeyError – If the sample with the given ID does not exist.
Return type:: None

Examples

>>> dataset = Data.from_dict({"cell1": {"TP53": -1}})
>>> dataset.delete_sample("cell1")
>>> "cell1" in dataset.data
False

update_sample(sample_id, features)#

Update the features of an existing sample or add a new sample if it does not exist.

Parameters:

sample_id (str) – The ID of the sample to update.
features (Dict[str, Any]) – A dictionary of features to update or add.

Return type:

None

Examples

>>> dataset = Data.from_dict({"cell1": {"TP53": -1}})
>>> dataset.update_sample("cell1", {"BRCA1": 0.5})
>>> dataset.data["cell1"].features
{'TP53': -1, 'BRCA1': 0.5}

get_feature_across_samples(feature_name)#

Retrieve a specific feature from all samples that contain it.

Parameters:: feature_name (str) – The name of the feature to retrieve.
Returns:: A dictionary mapping sample IDs to the value of the specified feature.
Return type:: Dict[str, Any]

Examples

>>> dataset = Data.from_dict({
...     "cell1": {"TP53": -1},
...     "cell2": {"TP53": 0, "BRCA1": 0.5}
... })
>>> result = dataset.get_feature_across_samples("TP53")
>>> result
{'cell1': -1, 'cell2': 0}

merge(other)#

Merge another Data instance into this one.

For samples that exist in both datasets, update the features with those from the other dataset. For new samples, simply add them to the current dataset.

Parameters:: other (Data) – Another Data instance to merge into this one.
Return type:: None

Examples

>>> ds1 = Data.from_dict({"cell1": {"TP53": -1}})
>>> ds2 = Data.from_dict({"cell1": {"BRCA1": 0.5}, "cell2": {"TP53": 0}})
>>> ds1.merge(ds2)
>>> ds1.data["cell1"].features
{'TP53': -1, 'BRCA1': 0.5}

filter_samples(predicate)#

Filter samples based on a predicate function.

Parameters:: predicate (Callable[[str, Sample], bool]) – A function that takes a sample ID and Sample, returning True if the sample should be included.
Returns:: A new Data instance containing only the samples that satisfy the predicate.
Return type:: Data

Examples

>>> dataset = Data.from_dict({
...     "cell1": {"TP53": -1},
...     "cell2": {"TP53": 0}
... })
>>> filtered = dataset.filter_samples(lambda sid, s: s.features.get("TP53", 0) < 0)
>>> list(filtered.data.keys())
['cell1']

collect_features(metadata_key, metadata_value, *, value_key='value', by_sample=False, return_values=False)#

Collect feature names or values across samples that have a specific metadata key-value pair.

This method checks each sample for features where the metadata specified by metadata_key equals metadata_value. It can return a flattened set or a dictionary keyed by sample ID.

Parameters:

metadata_key (str) – The metadata key to filter features.
metadata_value (Any) – The metadata value to match.
value_key (str, optional) – The key used to extract feature values. Defaults to “value”.
by_sample (bool, optional) – If True, returns a dict mapping sample IDs to sets of features. If False, returns a flattened set of features across all samples. Defaults to False.
return_values (bool, optional) – If True, collects the feature values; otherwise collects feature names. Defaults to False.

Returns:

Either a set of collected features or a dictionary mapping sample IDs to sets of features, depending on the by_sample flag.

Return type:

Union[Set[Any], Dict[str, Set[Any]]]

Examples

>>> dataset = Data.from_dict({
...     "cell1": {"TP53": {"value": -1, "type": "tumor_suppressor"}},
...     "cell2": {"KRAS": {"value": 1, "type": "oncogene"}},
...     "cell3": {"BRCA1": {"value": 0.5, "type": "tumor_suppressor"}}
... })
>>> collected = dataset.collect_features("type", "tumor_suppressor", return_values=True)
>>> isinstance(collected, set)
True

copy()#

Create a shallow copy of the dataset.

Each sample’s features are shallow-copied; if deep copies are needed, modify accordingly.

Returns:: A new instance of Data with copied samples.
Return type:: Data

Examples

>>> dataset = Data.from_dict({"cell1": {"TP53": -1}})
>>> new_dataset = dataset.copy()
>>> new_dataset.data["cell1"].features == dataset.data["cell1"].features
True