Dataset Caching
The base dataset class BaseDataset
defined in this library supports basic caching functionality.
Respective cache handles are defined in caching
.
About Caches
The base class of cache handles is Cache
.
A cache implementation provides methods
put()
and
load()
which will store a given object respectively load a previously pushed object using a
given descriptor. If load is called upon a descriptor for which no object
has been stored so far, None
is returned.
An example is here given for a file cache that stores objects to disk using torch.save()
:
>>> import os, torch
>>> from hybrid_learning.datasets.caching import PTCache
>>> mycache = PTCache(cache_root=".pytest_tmpdir")
>>> obj: torch.Tensor = torch.tensor([1,2,3])
>>> descriptor: str = "unique_descriptor"
>>> mycache.put(descriptor, obj)
>>> assert os.path.exists(os.path.join(mycache.cache_root, descriptor + ".pt"))
>>> print(mycache.load(descriptor))
tensor([1., 2., 3.])
>>> print(mycache.load("descriptor_of_not_yet_stored_object"))
None
Adding a Cache to a Dataset
Implementations of BaseDataset
allow to specify a cache handle in order to cache transformed items.
They feature a descriptor()
method that returns the unique descriptor of the sample at an index.
If the dataset is assigned a transforms_cache
handle, these descriptors are used to load or put a transformed sample into the cache.
To apply further transformations to items, independent on whether they were
loaded from cache or newly transformed using transforms
,
use the after_cache_transforms()
.
>>> from hybrid_learning.datasets.custom import coco
>>> concept_data = coco.ConceptDataset(
... body_parts=[coco.BodyParts.FACE],
... dataset_root=os.path.join("dataset", "coco_test", "images", "train2017"),
... transforms_cache=mycache
... )
Also, caching can be applied manually by decorating
__getitem__
-like functions with a cache’s
wrap()
method.