Dataset Caching
The base dataset class BaseDataset
defined in this library supports basic caching functionality.
Respective cache handles are defined in caching.
About Caches
The base class of cache handles is Cache.
A cache implementation provides methods
put() and
load()
which will store a given object respectively load a previously pushed object using a
given descriptor. If load is called upon a descriptor for which no object
has been stored so far, None is returned.
An example is here given for a file cache that stores objects to disk using torch.save():
>>> import os, torch
>>> from hybrid_learning.datasets.caching import PTCache
>>> mycache = PTCache(cache_root=".pytest_tmpdir")
>>> obj: torch.Tensor = torch.tensor([1,2,3])
>>> descriptor: str = "unique_descriptor"
>>> mycache.put(descriptor, obj)
>>> assert os.path.exists(os.path.join(mycache.cache_root, descriptor + ".pt"))
>>> print(mycache.load(descriptor))
tensor([1., 2., 3.])
>>> print(mycache.load("descriptor_of_not_yet_stored_object"))
None
Adding a Cache to a Dataset
Implementations of BaseDataset
allow to specify a cache handle in order to cache transformed items.
They feature a descriptor()
method that returns the unique descriptor of the sample at an index.
If the dataset is assigned a transforms_cache
handle, these descriptors are used to load or put a transformed sample into the cache.
To apply further transformations to items, independent on whether they were
loaded from cache or newly transformed using transforms,
use the after_cache_transforms().
>>> from hybrid_learning.datasets.custom import coco
>>> concept_data = coco.ConceptDataset(
... body_parts=[coco.BodyParts.FACE],
... dataset_root=os.path.join("dataset", "coco_test", "images", "train2017"),
... transforms_cache=mycache
... )
Also, caching can be applied manually by decorating
__getitem__-like functions with a cache’s
wrap() method.