Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagepy
Wcs = StorageClass(...)

Image = StorageClass(...)

MaskedImage = StorageClass(
    components={
        "image": Image,
        "variance": Image
    },
    ...
)

Exposure = StorageClass(
    components={
        "maskedImage": MaskedImage,
        "wcs": Wcs,
        "image": "maskedImage.image",        # aliases to more deeply-nested components
        "variance": "maskedImage.variance"
    },
    ...
)

Terminology

composite: a Dataset or DatasetType whose StorageClass defines a set of discrete named child datasets, called components

...

deferred: content is written via multiple calls to Butler.put (possibly in different SuperTasks) and associated later (possibly in yet another SuperTask) via a call to Butler.link.

...

  • All Datasets have an entry in the Registry's Dataset table.  This implies that a composite Dataset is more than the sum of its components: it also includes (Registry) information to associate them.
  • Any provenance graph (both what's recorded after processing and the QuantumGraph produced by Preflight) must contain nodes for both composite and components, because:
    • a SuperTask may consume only some components of a composite, so all component nodes must be in the graph;
    • a deferred virtual composite must be created explicitly, so it cannot be considered implicit in the graph;
    • whether a particular composite DatasetType is defined as concrete, virtual, immediate, or deferred should not affect what procesing is done, so we cannot include just some composite nodes in the graph.
  • All information needed to read a Dataset is saved at the level of the Dataset (in some combination of Registry or Datastore).  No information necessary for reading a Dataset is stored at a Datastore-wide or Registry-wide level, and it should never be necessary to configure a Butler a certain way in order to read something.
  • It must be possible to change whether a particular DatasetType is written as a concrete , virtual, immediate, or deferred composite or immediate virtual composite  by changing only the Butler/Datastore configuration provided when initializing a client.  The same is true for configuring a Dataset to be a deferred virtual composite, though of course this also requires Butler users (e.g. SuperTasks) to use link as well as put to create it.
    • As a result, any composite StorageClass must be writeable as concrete (with virtual components), immediate virtual, or deferred virtual; the StorageClass itself shall not be specialized to one of these choices.
    • No persistent Registry or Datastore state may be changed when controlling It should not be necessary to change persistent Registry or Datastore state (e.g. database tables or config files stored with the Datastore) in order to change how a composite DatasetType is written (it should be sufficient to change Butler/Datastore client information, though of course some configurations may be rejected by certain Datastores, and Datastores may provide persistent defaults for that configuration).

Permitted Combinations

  1. A virtual component must be a part of exactly one concrete composite.  For example, if Exposure A and Exposure B are each written as a single file, then A.wcs cannot also be a component of B.
  2. A virtual component may be a part of one or more deferred virtual composites.  For example, if Exposure A and Exposure B are each written as a single file, then an Exposure C may be defined such that C.wcs = A.wcs and C.maskedImage = B.maskedImage.
  3. A virtual component may be a part of at most one immediate virtual composite if and only if its concrete parent composite is also , but only indirectly: it must be a component of a concrete composite that is in turn a component of that the immediate virtual composite.  For example, if Exposure D is a virtual immediate composite, and its maskedImage is concrete, writing D writes D.maskedImage to a single file (probably1), and then associates the "image" virtual component creating D.maskedImage.image with D as D.image, a virtual component.
  4. A concrete composite always contains virtual components.  For example, writing an Exposure A as a single file always implies that a A.wcs is a valid dataset (though it may be permitted to be None/null).
  5. A concrete composite may not contain concrete components.
  6. A concrete composite may not contain virtual composites.
  7. An immediate virtual composite may contain one or more concrete components.  For example, if Exposure D is an immediate virtual composite, its maskedImage and wcs components will be written (when D is put) as separate files.
  8. An immediate virtual composite may contain one or more virtual components as long as it also contains their concrete composites , but only indirectly (this is a restatmenet restatement of (3)).
  9. An immediate virtual composite may contain other immediate virtual composites.  For example, if Exposure E is an immediate virtual composite, its maskedImage component may also be an immediate virtual composite, which means that the E.maskedImage.image and E.maskedImage.variance will each be written as a distinct concrete datasets (i.e. separate files) when E is put.
  10. An immediate virtual composite may not contain deferred virtual compositesdeferred virtual composites.  (Datasets are put or linked, but never both).
  11. A deferred virtual composite may contain one or more concrete components.  For example, we could write a stand-alone Wcs F, then later define an Exposure G such that G.wcs is F.
  12. A deferred virtual composite may contain one or more virtual components.  Those virtual components must still have concrete composite parents, but those concrete composite parents need not be children of the deferred virtual composite.  This is a restatement of (2).
  13. A deferred virtual composite may contain one or more immediate virtual composites.  For example, we could write a MaskedImage H as an intermediate virtual composite, resulting in F.image and F.variance being written as separate files.  We could then define an Exposure J such that J.maskedImage is H.
  14. A deferred virtual composite may contain one or more other deferred virtual composites.  For example, we could write two Images K and L, then define a MaskedImage M such that M.image=K and M.variance=L, and then define an Exposure N such that N.maskedImage=M (and N.image=K and N.variance=L).

...

Code Block
languagepy
# Given the StorageClasses defined above, this line:
registry.registerDatasetType(CalExp=DatasetType(StorageClass=Exposure, DataUnits=("Visit", "Sensor")))
# implies:
# registry.registerDatasetType(CalExp.wcs=DatasetType(StorageClass=Wcs, DataUnits=("Visit", "Sensor")))
#  above, this line:
registry.registerDatasetType(CalExp.maskedImage=DatasetType(StorageClass=MaskedImageExposure, DataUnits=("Visit", "Sensor")))
# implies:
# registry.registerDatasetType(CalExp.imagewcs=DatasetType(StorageClass=ImageWcs, DataUnits=("Visit", "Sensor")))
# registry.registerDatasetType(CalExp.variancemaskedImage=DatasetType(StorageClass=VarianceMaskedImage, DataUnits=("Visit", "Sensor")))

...

Code Block
languagepy
def Butler.put(self, obj, datasetType, dataId, producer=None):
    """Write a dataset.

    May not be a virtual component or a deferred virtual composite.
    """
    datasetType = self.registry.getDatasetType(datasetType)  # argument may have just been a string; now it's an object
    ref = self.registry.addDataset(datasetType, dataId, run=self.run, producer=producer)
    disassembler = self.config.getDisassembler(datasetType)
    if disassembler is not None:  # this is an immediate virtual composite
        childObjs = disassembler(obj)
        for childName, childDatasetType in datasetType.components.items():
            if self.config.getWriteFormatter(childDatasetType):  # not a virtual component
                childRef = self.put(childObj, childDatasetType, dataId, producer=producer)
            self.registry.attachComponent(parent=ref, child=childRef)
        self.registry.setAssembler(ref, self.config.getAssembler(datasetType))  # Could also consider putting this in Datastore
    else:   # this is concrete (and maybe a composite)
        for childName, childDatasetType in datasetType.components.items():  # if not a composite, loop body is never executed
            childRef = self.registry.addDataset(childDatasetType, dataId, run=self.run, producer=producer)
		    self.registry.attachComponent(parent=ref, child=childRef)
            self.datastore.addReaderput(childRefobj, self.config.getReadFormatter(childDatasetType))
ref)  # also associates readers with  self.datastore.put(obj, ref)each component
    return ref

def Butler.link(self, datasetType, childRefs, dataId, producer=None):
    """Create a deferred virtual composite dataset by associating existing datasets.

    There are two link overloads; this one is more powerful but less convenient in the common case.
    """
    ref = self.registry.addDataset(datasetType, dataId, run=self.run, producer=producer)
    for childRef in childRefs:
        self.registry.attachComponent(ref, childRef)
    self.registry.setAssembler(ref, self.config.getAssembler(datasetType))

def Butler.link(self, datasetType, childDatasetTypes, dataId, producer=None):
    """Create a deferred virtual composite dataset by associating existing datasets.

    There are two link overloads; this one is less powerful but more convenient in the common case.
    """
    # Look up the DatasetRefs using the given DataID and then call the other overload.
	self.link(datasetType,
              [self.registry.find(childDatasetType, dataId) for childDatasetType in childDatasetTypes],
              dataId)


def Registry.addDataset(self, datasetType, dataId, run, producer=None):
    dataset_id, registry_id = self.execute("INSERT INTO Dataset ...")
    return DatasetRef(datasetType, dataId, dataset_id, registry_id, ...)

def Registry.attachComponent(self, parent, child):
    self.execute("INSERT INTO DatasetComposition (parent_dataset_id, parent_registry_id, component_name, child_dataset_id, child_registry_id) ...")

def Registry.setAssembler(self, ref, assembler):
    self.execute("UPDATE Dataset SET assembler=? WHERE dataset_id=? AND registry_id=?", assembler.name, ref.dataset_id, ref.registry_id)


def Datastore.put(self, obj, ref):
    # ... actually write a file or something ...
    self.registry.execute("INSERT INTO Storage (dataset_id, registry_id, datastore_name, md4, size) ...")
    self.addReader(ref, self.config.getReadFormatter(ref.datasetType))
	for child in children:
        self.addReader(...)

def Datastore.addReader(self, ref, formatter):
    # ... record somewhere that we should use the given formatter when asked to read back ref ...

...