(Draft of an RFC.)
Adopt a canonical serialization for DataId and DatasetRef objects.
To facilitate reliable exchange of information "outside of code", primarily through the human world and in copy/paste operations, about specific datasets.
To avoid requiring consumers of results from dataset metadata queries to have to rebuild DataIds
by assembling pieces from multiple columns returned from the query.
DatasetRef
values for the datasets reported in the query result would enable a DatasetRef
-valued column to be added to the query result, marked-up with data model metadata that enables a programmatic or UI client to recognize the column as such. This in turn would enable the following applications:ID=
query parameter takes a DataId or DatasetRef serialization and returns references to related data. This could be used for things like linking a calexp to the directly associated raw, difference image, etc., and it could also be used to look up indirect associations like going from a visit image to the calibration images that are configured to be the appropriate ones for it.It is proposed to use JSON-LD (standard | Wikipedia) for the canonical serialization.
This means that the data content of a DataId or a DatasetRef will be represented as JSON, but in addition that the JSON will include type information referenced to a vocabulary published by LSST.
A typical JSON-LD object might look like this (hat tip to Brian Van Klaveren):
{ "@context": { "@vocab": "http://lsst.org/butler-dm/v3/", }, "@type": "DataId", [all other JSON here] } |
where http://lsst.org/butler-dm/v3
defines the root of a vocabulary of relevant terms, such as DatasetRef, DataId, Dimension, etc.
In cases where many serialized values must be transmitted, e.g., as a column of DatasetRef
values in a serialized table, we envision using table-level metadata to define the column type in such a way that consumers of the table can retrieve the JSON-LD type information, and limiting the JSON text of each row's value in the column to the actual data content. A client of the serialized table (e.g., a Python API wrapping an HTTP query, or a UI client) can use the values directly, or re-wrap the values with the type information if it is appropriate to regenerate the full JSON-LD object.
For a VOTable serialization, we will work with the IVOA to devise a specific proposal for the content of the <FIELD>
element of the VOTable header that would realize this idea. We anticipate this may mean using either a special utype value or UCD to indicate JSON-LD-typed data.
The Felis specification language for the LSST data model is already based on JSON-LD and thus would naturally accommodate the inclusion of JSON-LD-typed columns in tables defined in that data model. We expect this capability to be used to define DataId
and/or DatasetRef
column types in the dataset metadata tables exposed via ObsTAP and/or SIAv2 services.
Will there be distinct types for the "minimal" and "fully expanded" DataId
types introduced in ?
What attributes of DataId and DatasetRef will be persisted? (E.g., will any of the producer / consumer / run information be persisted?)