You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

work in progress

Overview

We can consider three kinds of composite dataset: virtualmonolithic, and hybrid.  A Dataset that has an entry in the Dataset table is called concrete.

  • a virtual composite has concrete children;
  • a monolithic composite has a concrete parent;
  • hybrid composite has both concrete children and a concrete parent.

We probably don't need all of these, and might be able to get away with just hybrid (our original design).  The scenarios below consider different combinations of these.

Virtual Only

Ruled out: no way to support reading/writing a parent dataset into a single file.

Monolithic Only

Ruled out: no way to support the "define an Exposure DatasetType with calexp's image and jointcal's wcs" use case.

Hybrid Only

The original design.

Composites that are defined in terms of separately-written components still require an additional Butler call for the parent, to add the parent Dataset record as well as the join-table records that connect the parent to its children.

This implies that both parents and children must be included in the expanded QuantumGraph, which may in turn require trivial SuperTasks to add parents when this is not a clear responsibility of any of the SuperTasks that produce the children.

Virtual+Monolithic

This is quite different from the Hybrid Only, but it may provide a better user experience (no need for trivial SuperTasks for parent definitions) and a simpler conceptual design.

Composites that are defined in terms of separately-written components are handled as virtuals, so they do not recall an additional Butler call for the parent.  This requires that the DataUnits utilized by the virtual parent be the strict superset of the DataUnits utilized by its children, which is probably not a practical concern.

Composites that are written via a single put may be handled in two ways:

  1. Always monolithic: whether the Dataset is written as one file or many is up to the Datastore and is completely hidden from the Registry.  When retreiving a component, Butler sees (via DatasetTypeComposition entries in the Registry) that the component is part of a monolithic dataset, and passes to Datastore.get the parent dataset identifiers along with the name of the component.  This may provide an opportunity to bundle calls that retrieve multiple components from the same parent.
  2. Monolithic for single-file storage, virtual for multi-file storage.  This moves the single-file/multi-file decision from Datastore configuration to the DatasetType definition in the Registry.  This may be advantageous for transfers, but it would require having multiple DatasetTypes just to support saving the same conceptual data product different ways.  We could in turn address/mitigate that by having some kind of namespacing/versioning system for DatasetType definitions, which we may need anyway for long-lived repositories.

The expanded QuantumGraph in this scenario can always contain only child Datasets, because there is no possibility that all of the components of a composite exist without the parent existing as well.  Alternatively, the expanded QuantumGraph could contain only concrete Datasets (children for virtual composites, parents for monolithic composites), which may make it much easier to extract e.g. filenames from the graph in case (2).

Virtual+Hybrid

Relative to Hybrid Only, this permits some composite Datasets to be no more than the some of their parts, removing the need for trivial SuperTasks to define these parent Datasets.  If composites that only refer to already-written children are always handled as virtuals, then QuantumGraphs can always contain only children, but this adds the same constraint on parent/child DataUnits as the Virtual+Monolithic case.

Relative to Virtual+Monolithic, this option seems more complex without any obvious benefit, but it may be worth a closer look if Virtual+Monolithic seems problematic after a closer look.

Monolithic+Hybrid

Not considered; no advantage over hybrid alone.


  • No labels