(Draft of an RFC.)

Proposal

Adopt a canonical serialization for DataId and DatasetRef objects.

Motivation and Use Cases

To facilitate reliable exchange of information "outside of code", primarily through the human world and in copy/paste operations, about specific datasets.

To avoid requiring consumers of results from dataset metadata queries to have to rebuild DataIds by assembling pieces from multiple columns returned from the query.

  1. When performing an ObsTAP or SIAv2 query against LSST data, the availability of a string serialization of DatasetRef values for the datasets reported in the query result would enable a DatasetRef-valued column to be added to the query result, marked-up with data model metadata that enables a programmatic or UI client to recognize the column as such.  This in turn would enable the following applications:
    1. For the case of programmatic queries, LSST can provide a Python query API that directly returns DatasetRef or DataId objects, as appropriate, which can then be used directly in Python code to access and process the datasets in question.
    2. For the case of UI-driven (Portal Aspect) queries, the Portal can provide UI actions such as "copy reference to clipboard" that facilitate interoperation between the Portal and Notebook Aspects - the user can search for data in the Portal and then paste immediately usable references to that data directly into Python code in the Notebook.
    3. A programmatic or UI client can use IVOA DataLink metadata in a query result to discover additional resources related to a row returned from a dataset metadata query.  See (2) below for more detail.
    4. By being able to recognize  a column as containing a DatasetRef serialization, a UI client could display it with space-efficient UI elements like a DATASETREF icon with a copy-to-clipboard function, instead of displaying a cryptic JSON string in a wide column.
  2. One or more IVOA DataLink "{links} services" can be created whose ID=  query parameter takes a DataId or DatasetRef serialization and returns references to related data.  This could be used for things like linking a calexp to the directly associated raw, difference image, etc., and it could also be used to look up indirect associations like going from a visit image to the calibration images that are configured to be the appropriate ones for it.
  3. In automatically-generated, or even human-generated, reports, datasets that had processing problems or were otherwise worth noting could be identified including the canonical serialization, facilitating readers going to a Python prompt to perform further analysis on that dataset.

Serialization technology

It is proposed to use JSON-LD (standard | Wikipedia) for the canonical serialization.

This means that the data content of a DataId or a DatasetRef will be represented as JSON, but in addition that the JSON will include type information referenced to a vocabulary published by LSST.

A typical JSON-LD object might look like this (hat tip to Brian Van Klaveren):

{
  "@context": {
    "@vocab": "http://lsst.org/butler-dm/v3/",
  },
  "@type": "DataId",
  [all other JSON here]
}

where http://lsst.org/butler-dm/v3  defines the root of a vocabulary of relevant terms, such as DatasetRef, DataId, Dimension, etc.

In cases where many serialized values must be transmitted, e.g., as a column of DatasetRef values in a serialized table, we envision using table-level metadata to define the column type in such a way that consumers of the table can retrieve the JSON-LD type information, and limiting the JSON text of each row's value in the column to the actual data content.  A client of the serialized table (e.g., a Python API wrapping an HTTP query, or a UI client) can use the values directly, or re-wrap the values with the type information if it is appropriate to regenerate the full JSON-LD object.

For a VOTable serialization, we will work with the IVOA to devise a specific proposal for the content of the <FIELD> element of the VOTable header that would realize this idea.  We anticipate this may mean using either a special utype value or UCD to indicate JSON-LD-typed data.

The Felis specification language for the LSST data model is already based on JSON-LD and thus would naturally accommodate the inclusion of JSON-LD-typed columns in tables defined in that data model.  We expect this capability to be used to define DataId and/or DatasetRef column types in the dataset metadata tables exposed via ObsTAP and/or SIAv2 services.

Open questions

Representation of both types of DataId

Will there be distinct types for the "minimal" and "fully expanded" DataId types introduced in  DM-17023 - Getting issue details... STATUS ?

Exact content of the serialization

What attributes of DataId and DatasetRef will be persisted?  (E.g., will any of the producer / consumer / run information be persisted?)

  • No labels

3 Comments

  1. While the goals. motivations, and use cases are laudable, I'm a bit concerned that the proposed JSON-based serialization is too verbose and complex for something intended for "the human world".  It seems to me that (nearly?) all usages will have an implicit type from the usage context or can have an explicit type in metadata separate from the actual serialized information.  It also seems to me that mechanisms for specifying DataIds to PipelineTasks in command lines (e.g. SQL-WHERE-like strings) will need to be relatively compact (e.g. not needing to quote identifiers) and human-friendly; having two ways of specifying the same DataId could also be confusing, so I'd suggest unifying these as much as possible.

  2. I think it makes sense for this to be implemented outside daf_butler (in a package that depends on it), in order to minimize daf_butler's dependencies.

    Whether it makes sense to support serialization for different types of Data IDs depends on where they come from in various use cases:

    • Data IDs obtained from the Butler or in contexts in which a Registry client is available will generally be fully expanded (Python type ExpandedDataCoordinate ).
    • Data IDs provided by users as inputs to queries are probably just informal dictionaries; upgrading them to a validated but minimal DataCoordinate  also de facto requires a Registry instance (it actually just needs the data repository configuration and some Python code, not an actual database, but I think that's a distinction without a difference).

    Note that the expression syntax used to constrain processing on the command line is a query that yields data IDs, not a way of actually passing data IDs.  We may have a need to pass data IDs more directly on the command-line in the future, but don't have one yet and haven't spec'd anything out.  So (in response to Kian-Tat Lim 's comment) I think I'd lean towards targeting the ideas on this page towards web-API-friendly serialization rather than thinking of it as a human-friendly format.

  3. This was not meant to be a human-friendly format in the sense of "human-writeable".  It's meant to be a robust format for moving DataId and DatasetRef values around outside Python, but I did not imagine people hand-typing them.  It was probably a mistake to write "through the human world" above, rather than just "outside Python".  (BaBar had such a thing for the equivalent concept, and it even included a check digit to actively discourage  people from hand-writing them. I'm not specifically requesting that in the RFC, but I'd be open to it.)

    I am not very enthusiastic, though, about actively obscuring the content, e.g., by defining the serialization to be a base64 representation of a nominal binary layout.  The ability for humans to be able to read  these sorts of strings seems useful and avoids our having to build more tools, e.g., to support unpacking.  If for some reason we need to make the strings more compact, we can do that as an optimization later on.

    Regarding "(nearly?) all usages will have an implicit type from the usage context or can have an explicit type in metadata separate from the actual serialized information.":

    1. This was addressed in the proposal text - in a VOTable with a column of such a type, I'd like to strip the JSON-LD type content from the per-row representations, and hold it in the table metadata.
    2. When a single value is copied to the clipboard, I think it is important to have the type metadata be present.  This enables a variety of future capabilities, including context-sensitive drag-and-drop, for instance.

    I basically agree with Jim Bosch about using a separate package.  In this case, though, what is the most natural & Pythonic way to implement something that works like the following?

    d = DataId( '<paste clipboard contents here>' )