This page has been completely rewritten from its original form, and the terminology has changed. In particular, because we've settled on an approach in which all datasets always have an entry in the Registry's Dataset table, we have repurposed "virtual" to mean something different.
I'll use the Exposure StorageClass for most examples. For the purposes of this page, we'll assume it has the following simplified definition (pseudocode; not actual APIs):
Wcs = StorageClass(...) Image = StorageClass(...) MaskedImage = StorageClass( components={ "image": Image, "variance": Image }, ... ) Exposure = StorageClass( components={ "maskedImage": MaskedImage, "wcs": Wcs, }, ... ) |
composite: a Dataset or DatasetType whose StorageClass defines a set of discrete named child datasets, called components
parent: synonym for composite
component: a Dataset or DatasetType that may be accessed as a child of a composite (in some cases may also be accessed in other ways)
child: synonym for component
virtual: a Dataset or DatasetType that is defined by its relationship to one or more other Datasets/DatasetTypes.
concrete: not virtual
immediate: all content is written in a single call to Butler.put
deferred: content is written via multiple calls to Butler.put
(possibly in different SuperTasks) and associated later (possibly in yet another SuperTask) via a call to Butler.link
.
This leads to four fundamental kinds of Dataset[Type]s:
Datastore.put
call and read by a single Datastore.get
call. Writing a concrete composite also creates virtual components. Examples: a WCS written on its own, an Exposure written all at once into a single file1, or a Wcs written to its own file when writing an Exposure to multiple files.1. I'm saying "file" rather than "Dataset" here (and below) both to provide a clearer example and because I think concrete composite vs. immediate virtual composites is how we'll want to implement single-file Exposure vs. multi-file Exposure. This design doesn't actually guarantee that a Datastore will write a concrete composite as a single file (after all, it could even write some/all of it to a SQL database instead), but Datastores that actually do write files shouldn't need to be able to split up concrete composites into multiple files themselves. The design does guarantee that an immediate virtual composite will not be written as a single file.
link
as well as put
to create it.When a DatasetType with a composite StorageClass is declared to a Registry, DatasetTypes for each of the named components are also declared, with names constructed as "{parent-dataset-type-name}.{component-name}", the same DataUnits types as the parent, and StorageClasses that are the same as what would be used to write the component as a standalone Dataset.
For example (pseudocode):
# Given the StorageClasses defined above, this line: registry.registerDatasetType(CalExp=DatasetType(StorageClass=Exposure, DataUnits=("Visit", "Sensor"))) # implies: # registry.registerDatasetType(CalExp.wcs=DatasetType(StorageClass=Wcs, DataUnits=("Visit", "Sensor"))) # registry.registerDatasetType(CalExp.maskedImage=DatasetType(StorageClass=MaskedImage, DataUnits=("Visit", "Sensor"))) |
As we'll see below, these component DatasetTypes will be used by virtual components of concrete composites and concrete components of immediate virtual composites, but will not be used for components of deferred virtual composites (because those will have already been written and added to the Registry using some other DatasetType).
Pseudocode, no error-handling/transactions:
def Butler.put(self, obj, datasetType, dataId, producer=None): """Write a dataset. May not be a virtual component or a deferred virtual composite. """ datasetType = self.registry.getDatasetType(datasetType) # argument may have just been a string; now it's an object ref = self.registry.addDataset(datasetType, dataId, run=self.run, producer=producer) disassembler = self.config.getDisassembler(datasetType) if disassembler is not None: # this is an immediate virtual composite childObjs = disassembler(obj) for childName, childDatasetType in datasetType.components.items(): childRef = self.put(childObj, childDatasetType, dataId, producer=producer) self.registry.attachComponent(parent=ref, child=childRef) self.registry.setAssembler(ref, self.config.getAssembler(datasetType)) # Could also consider putting this in Datastore else: # this is concrete (and maybe a composite) for childName, childDatasetType in datasetType.components.items(): # if not a composite, loop body is never executed childRef = self.registry.addDataset(childDatasetType, dataId, run=self.run, producer=producer) self.registry.attachComponent(parent=ref, child=childRef) self.datastore.addReader(childRef, self.config.getReadFormatter(childDatasetType)) self.datastore.put(obj, ref) return ref def Butler.link(self, datasetType, childRefs, dataId, producer=None): """Create a deferred virtual composite dataset by associating existing datasets. There are two link overloads; this one is more powerful but less convenient in the common case. """ ref = self.registry.addDataset(datasetType, dataId, run=self.run, producer=producer) for childRef in childRefs: self.registry.attachComponent(ref, childRef) self.registry.setAssembler(ref, self.config.getAssembler(datasetType)) def Butler.link(self, datasetType, childDatasetTypes, dataId, producer=None): """Create a deferred virtual composite dataset by associating existing datasets. There are two link overloads; this one is less powerful but more convenient in the common case. """ # Look up the DatasetRefs using the given DataID and then call the other overload. self.link(datasetType, [self.registry.find(childDatasetType, dataId) for childDatasetType in childDatasetTypes], dataId) def Registry.addDataset(self, datasetType, dataId, run, producer=None): dataset_id, registry_id = self.execute("INSERT INTO Dataset ...") return DatasetRef(datasetType, dataId, dataset_id, registry_id, ...) def Registry.attachComponent(self, parent, child): self.execute("INSERT INTO DatasetComposition (parent_dataset_id, parent_registry_id, component_name, child_dataset_id, child_registry_id) ...") def Registry.setAssembler(self, ref, assembler): self.execute("UPDATE Dataset SET assembler=? WHERE dataset_id=? AND registry_id=?", assembler.name, ref.dataset_id, ref.registry_id) def Datastore.put(self, obj, ref): # ... actually write a file or something ... self.registry.execute("INSERT INTO Storage (dataset_id, registry_id, datastore_name, md4, size) ...") self.addReader(ref, self.config.getReadFormatter(ref.datasetType)) def Datastore.addReader(self, ref, formatter): # ... record somewhere that we should use the given formatter when asked to read back ref ... |
Again, just pseudocode, no error handling:
def Butler.get(self, datasetType, dataId): ref = self.registry.find(datasetType, dataId) if ref.assembler is not None: # virtual composite return ref.assembler({childName: self.datastore.get(childRef.dataset_id, childRef.registry_id) for childName, childRef in ref.components.items()}) else: # virtual component or concrete return self.datastore.get(ref.dataset_id, ref.registry_id) def Registry.find(self, datasetType, dataId): ref = DatasetRef(self.execute("SELECT * FROM Dataset WHERE dataset_type_name=? AND ...", datasetType.name, dataId)) if ref is None: # If this is a component of a deferred virtual composite, the Dataset table will have its original # dataset_type_name and data ID, and the above query will fail. Instead we look for that composite # and then find the component by name. parentDatasetTypeName, componentName = datasetType.split() # split the part before the first "." from the part after it datasetEntry = self.execute(""" SELECT Child.* FROM Dataset AS Child INNER JOIN DatasetComposition ON (...) INNER JOIN Dataset AS Parent ON (...) WHERE DatasetComposition.component_name=? AND Parent.dataset_type_name=? AND Parent.{data_unit_fields}=... """, componentName, parentDatasetTypeName, dataId ) if datasetType.components: ref.components = { result["name"]: Dataset(result) for result in self.execute(""" SELECT Child.* FROM Dataset AS Child INNER JOIN DatasetComposition ON (...) WHERE DatasetComposition.parent_dataset_id=? AND DatasetComposition.parent_registry_id=? """, ref.dataset_id, ref.registry_id ) } return ref |