You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

This page has been completely rewritten from its original form, and the terminology has changed.  In particular because we've settled on an approach in which all datasets always have an entry in the Registry's Dataset table, we have repurposed "virtual" to mean something different.

Terminology

composite: a Dataset or DatasetType whose StorageClass defines a set of discrete named child datasets, called components

parent: synonym for composite

component: a Dataset or DatasetType that may be accessed as a child of a composite (in some cases may also be accessed in other ways)

child: synonym for component

virtual: a Dataset or DatasetType that is defined by its relationship to one or more other Datasets/DatasetTypes.

concrete: not virtual

immediate: all content is written in a single call to Butler.put

deferred: content is written via multiple calls to Butler.put and associated via a later call to Butler.link.

This leads to four fundamental kinds of Dataset[Type]s:

  • concrete (always immediate, may be a component or a composite)
  • virtual component (always immediate)
  • immediate virtual composite
  • deferred virtual composite

By design, the distinction between virtual and concrete is meaningful for both get and put, but the distinction between immediate and deferred is meaningful only for put.

Principles

  • All Datasets have an entry in the Registry's Dataset table.  This implies that a composite Dataset is more than the sum of its components: it also includes (Registry) information to associate them.
  • Any provenance graph (both what's recording after processing and the QuantumGraph produced by Preflight) must contain nodes for both composite and components, because:
    • a SuperTask may consume only some components of a composite, so all component nodes must be in the graph;
    • a deferred virtual composite must be created explicitly, so it cannot be considered implicit in the graph;
    • whether a particular composite DatasetType is defined as concrete, virtual, immediate, or deferred is considered hidden from SuperTasks, so we cannot include just some composite nodes in the graph.
  • All information needed to read a Dataset is saved at the level of the Dataset (in some combination of Registry or Datastore).  No information necessary for reading a Dataset is stored at a Datastore-wide or Registry-wide level, and it should never be necessary to configure a Butler a certain way in order to read something.
  • It must be possible to change whether a particular DatasetType is written as concrete, virtual, immediate, or deferred by changing only the Butler/Datastore configuration provided when initializing a client.
    • As a result, any composite StorageClass must be writeable as concrete (with virtual components), immediate virtual, or deferred virtual; the StorageClass itself shall not be specialized to one of these choices.
    • No Registry content should be changed when controlling how a composite DatasetType is written.

Configuration

todo

Writing Datasets

todo

Reading Datasets

todo


  • No labels