Public access to data post-proprietary-period

From the meeting agenda

We should develop and advertise a clearer plan for how non-Data Rights holders can access data release(s) that are no longer proprietary.
- Bulk access through a cloud host?
- Unauthenticated API or Portal access?
- Something else?
- More if they pay?
Have to make sure this is consistent with Ops project thinking.

Some considerations from Gregory Dubois-Felsmann:

The original LSST story was rather provocative from the point of view of a non-data-rights-holder:

We will serve the most recent annual data release and the one before it.
- I.e., any given data release (after the startup transient) is served for more or less precisely two years.
The data are no longer proprietary following 24 months from their release.
- I.e., they become non-proprietary at precisely the moment that they are removed from service.

Subsequently, after input from Beth Willman and others, the story about how many data releases are kept spinning has been softened.

DMS-REQ-0364 and surrounding LSE-61 requirements were devised to ensure that the construction project delivers a system that gives the operations organization great flexibility in deciding how many older releases to serve, up to "all of them", depending on available resources.
- These requirements include flexibility in determining whether to serve only subsets of releases older than two years (e.g., just catalogs, just coadded images, etc.)

However, at the same time, in the context of the previous round of negotiations on operations-funding-support-for-data deals:

We made clear that "data rights" are not the same as "data access".
- I.e., that even if we preserve access through the project-provided DACs to releases older than two years, that access may still only be available to a limited community.

Planning for providing data access, of any kind, to non-data-rights / non-data-access-rights holders, should include the following considerations:

Will access still need to be authenticated (but open to anyone who is registered)? This goes to issues of abuse-detection and the ability to assess who is using resources.
Can we confirm that every DPDD data product that is available to data rights holders will, ultimately, be available in some form to all comers after the proprietary period is over?
Are we prepared for what is likely to be a pulse of very substantial demand for the data immediately following their transition to non-proprietary?
Do we aspire to directly support a larger user community, or only to support non-Project data centers that wish to serve the data to their users?
In the context of the LSP, could non-proprietary data access be limited just to the API Aspect (i.e., to our IVOA and related network APIs)? API + Portal Aspects? (I am assuming that it is not financially realistic to offer Notebook Aspect, next-to-data processing, and batch services to a larger community.)
Or are we imagining providing access in a way that's fundamentally different from the LSP, e.g., just as a big bag of files with a handbook explaining their organization and content?
What access do we provide to non-rights-holders to Prompt data that are more than two years old? Or are we going to claim that we are only releasing the AP data products that are re-created in the DRs?

Note that in no case are we required to do ANYTHING until DR3.

Options available in the baseline:

Use the data-center-to-data-center Bulk Download facility, which is already in the baseline, to export data to any data center that wants it.
- Require an agreement on compensation for marginal costs for data centers that really want all of, or a large fraction of, the data volume.
- Perhaps define a "free subset" (e.g., just the Object table) that requires no compensation (as long as the requestor is still a data center and commits to some level of republication).
- Publish documentation that assists a non-Project data center in what is required to take bulk-downloaded data, and the (open-source) LSP software, and put them together to provide an LSP-like DAC on top of the data.

Options beyond the baseline:

Make a restricted subset of LSP functionality (e.g., API Aspect and Portal Aspect, but not Notebook Aspect) freely available to (self-registered?) non-rights-holders for at least the newest non-proprietary DR.
- Authentication still required.
- Would this access include the ability to invoke Web-accessible on-demand services like non-archived-data-product recreation? Forced-photometry-on-demand?
- Would this include all the DPDD data products, or only a subset - with the remainder accessible only via the bulk download route above?
- This could be LDF-hosted or cloud-hosted; the main point here is the "free" - that is, "free at the point of use".
Make a similar service available in the cloud, but with users responsible for whatever marginal costs they incur (egress charges, compute charges).
- Cloud providers might be willing to host the data for that "at rest" for free as long as they could charge users marginal costs.
- Our experimental work on TAP-over-Google-BigQuery is aimed in this direction - the data could be hosted in BigQuery for free "at rest" but an individual user would bring up their own personal TAP service (in a very easy-to-set-up manner) which could then be subject to charges for BigQuery query access and cloud egress. Similarly a user could run their own instance of the Portal Aspect in the cloud in addition to the API Aspect.
- This model would even allow non-rights-holders to have the full LSP experience, including Nublado and next-to-data query services, if they pay for it.
- Would users accept that the combination of this for-a-fee (but not collected by Rubin/LSST) service plus the bulk-download-to-data-center service and the (easy to anticipate) existence of many clones of the survey data in other data centers (e.g., at CDS) counts as genuine non-proprietary access?
Make a "documented bag of files" available in the cloud, accessible to anyone
- Parquet for catalogs and metadata catalogs, FITS (or a successor) for images.
- Rubin/LSST pays the "at rest" charges and perhaps negotiates to cover the access charges for a small subset of the total dataset (e.g., just the Object table); users pay egress charges beyond that subset.
- No LSP-like services are provided at all.
- As above, documentation is provided for how to set up LSP-like services at your home institution based on the downloaded data.

Suggestions:

We should do a survey of all existing large-scale survey bulk data access at major data centers (WISE, Gaia, PanSTARRS, etc.).

Space shortcuts

Page tree