2017-12-06 Notes by KTL

API Design and AAIM

From yesterday:

Authorization is done through group membership; there will be a naming convention (to be cleared with NCSA LDAP team), and the CILogon API will return both group names and gids

Groups will include functional rights (ability to use particular services) as well as data rights (both "public" and "private")

How can you determine whether an operation is authorized?

Do you have to try it and fail or can you ask some service/library in advance or does every dataset/table/whatever have an associated list of authorized groups?

Preference is for a microservice that would return a yes/no given a token and a group name or gid (that is from a service configuration); for now, this should be a trivial "extract group list and string/id match" operation

Could also be done using LDAP, but prefer not to for now; LDAP may be more current and can deal with long lists of groups, which could grow to hundreds per user; later, we could use LDAP as the back-end for the microservice

NCSA should write this microservice; James Basney accepts this responsibility and will propose a REST API by the end of January

When a VOSpace request is made, bearer token goes into the header; service can get the JSON web token and cache it

Brian Van Klaveren to specify how the token information is received by each Web API service; this should be a few sentences in the documentation as well as a recommendation for how to use existing client libraries (AstroQuery, PyVO, etc.) to support OAuth; if absolutely necessary, we could provide our own client library

Is it possible to create user tokens that are read-only?

This is an interesting and likely useful functionality but is not trivial to implement
Figuring out what "read-only" actually means is also an issue, probably more like "do no harm"

Brian Van Klaveren and Gregory Dubois-Felsmann to discuss feasibility

TOPCAT: third-party application developers have challenges working with our security model

Third-party applications running on data-rights-user-provided hardware accessing LSST data via Web API Aspect
Application may need to launch a Web browser to perform the login, then returning the token to the application may be a challenge
Redirect to application URL registered with OS/browser might be a possibility
- Likely means that the application has to be signed
Another way is to return a copy-pasteable token; this is good for long-lived tokens but not short-lived ones
- Having a "renew this token" flow might be an amelioration

Central user profile

Query history, including hiding, would be large and so would go into Aspect-specific storage
- Is this a TAP query? Perhaps, could be stored in a MyDB table
Default Data Release could be cross-Aspect
Storage of Alert Filtering configuration might go here or elsewhere
Unknown User (awithers) will confirm that general key/value information can be put into the CILogon system, but this is no longer urgent given discussion below
Need to confirm that storage and APIs (LDAP) are sufficient
- May make it difficult to bring up in other environments or to build sandboxes
- Binary data and user-settable data may not be ideal in LDAP
Should be instance-specific, not global
Preferences directory in the User File Workspace; not a single file because we don't understand all the data being stored
- Redis or etcd could also be used but only if the above is found to be inadequate
Examples of preferences:
- Default heuristic for picking a container for the Notebook (e.g. always last weekly)
- Default Portal layout, image stretch preference, coordinate system, x/y plot, table configuration (which columns are displayed)
- Localization like language, timezone (seems more like LDAP user profile); alternative usernames/identifiers (e.g. on GitHub or Slack)
- How much query history is to be stored could be an API preference
- Default Data Release

Design document needs to include a subsection on security policies/issues for each component

How this component relates to/implements top-level security policies
Any specific issues of interest to this component (e.g. we don't allow root access in user containers)

SODA service and Portal compute plugins need to execute on behalf of a user

Today imgserv runs as the webserv user
It will in the future need to run as root and then seteuid to the authenticated user whenever a butler.get() is performed; Firefly would do the same thing (using a forked setuid executable, possibly started at the beginning of the user session)
Need to have a requirement that instantiating a cached and to-be-reused Butler never retrieves information that would be inaccessible to the get() user
A "data release" Butler could be used for a data-rights-holding user accessing data release products
This type of code will require auditing

IVOA services should not write except TAP and VOSpace

File Workspace is best as a subdirectory of the user's "home" directory; that home directory is only accessible via JupyterLab; everything in the File Workspace is accessible via VOSpace (and WebDAV)

Are the accounts that people get "NCSA" accounts?

They are in the NCSA uid and username space, but they have no authorization by default to do anything else at NCSA, so effectively no; should mention this in the design document but this is an operational choice of NCSA
Unclear if current NCSA users can gain LSST rights or how LSST users gain NCSA rights but either of those is feasible given this system

Deployment Issues

Management Domain	Initial Availability	Purpose	Intent	Deployment	Local Qserv Data	Batch Resources	Raw Data, Data Release Products, Metadata (DBB)	User Workspace	Unreleased DR and Intermediates	Live EFD & Obs Ops Data
PDAC	2017	Development	Developing Portal by SUIT devs; developing Notebooks by SQuaRE devs; developing Web APIs and Qserv by DAX devs	By SUIT for Portal, by SQuaRE for Notebooks, by DAX for Web APIs and Qserv Includes both development deployments and more-stable deployments that the other Aspects can rely on	SDSS S82, WISE, HSC, Simulated data (2018-?), LSST DR	NCSA Batch	Yes	Yes	No	No
PDAC	2018	Integration	Integration & testing at LDF	By LDF	SDSS S82, WISE, HSC, Simulated data (2018-?), LSST DR	NCSA Batch	Yes	Yes	?	No
Science Validation	2018-12	Science Validation	Pipeline development & DR validation for Science Ops staff Analysis of spectrograph data	By LDF	In-preparation DR, precursor data such as HSC and simulated data	NCSA Batch	Yes	Yes	Yes	No
Commissioning Cluster	Scheduled 2019-01-01, but dependent on Base occupancy on 2018-03-01	Commissioning Cluster	Rapid analysis for Commissioning Team and Observatory Ops staff	By LDF	None?	Commissioning Cluster Batch, Base Batch, and NCSA Batch	Yes	X	?	Yes
US DAC	2021? for testing, possibly earlier if used for Commissioning Data Releases	US DAC	Science analysis for data rights users	By LDF	LSST DRs, imported catalogs	NCSA Batch	Yes	Yes	No	No
Chilean DAC	2022	Chilean DAC	Science analysis for data rights users	By LDF	LSST DRs, imported catalogs	Base Batch	Yes	X	No	No

Verify that Qserv for Science Validation is in the FY19 budget plan; otherwise a monolithic database might be usable

Image File and Database Versions

Need to have raw pixel data or PVI pixels with best WCS, need to have PVI with metadata header corrected post-DR, etc.

Also need to be able to retrieve exact file and metadata headers initially released

Images will be retrieved from DAX from within collections; each collection is a single type of data product within a single DR; metadata about the collections needs to be available from metaserv

Should there be a URL distinction between Data Releases (particularly for TAP queries)?

That makes for many TAP endpoints, whether the distinction is in the path or the domain name
Prefer to have a single TAP endpoint that determines the database by parsing the query; it can then dispatch to different database servers for the back-end

Data Model

DPDD is logical/aspirational/approximate, not physical

Tabular data is persisted as table files and read back as whole files which is guaranteed to round-trip

In database tables, there's a loading process; will it round-trip? (There is no requirement to do so)

Will we store all generated columns in the table files in the database?

Names in afw tables are not the same as the DPDD; they may contain more information

cat package "baseline" schema has been updated relatively recently for the Alert Production tables
Slots could be used to make them the same, either by materializing them as columns or as aliases in a view
Alias/mapping definitions have to come from the science side and be implemented by mechanisms provided by the database and/or afw table

Metadata needed to explain columns: help text, units, UCDs, linkages between columns, possibly VO-DML

daf_ingest should take in additional information to load into "meta-tables" that describe the database tables
It will check that the afw tables that it is loading conform
UCDs and units should definitely be built in at the Science Pipelines level; linkages are highly desirable
VO-DML is an extremely complex specification, might want to push something back to them

Pipelines need to audit columns in afw tables and see if there is any custom transformation that is needed (e.g. angles in radians get converted to decimal degrees)

Start to have catalog outputs from periodic HSC runs loaded into lsst-db

Have each of the LaTeX tables in DPDD be generated by YAML (which could replace the cat schema) that can also generate validation code to test the loaded database and the afw table outputs

Metadata should be loaded along with the shared afw table schema, rather than with every afw table

Visit tables can be loaded from FITS file headers today; there are also Butler registries today

In the future, which visits exist and registries will be generated by the Archiver in the DBB; Science Pipelines will output additional per-visit information; this could be in afw tables or in image header metadata

Using SQLite for ingest for CI runs would be nice to avoid dependence on a DB server

All DPDD data products need to have at least one defined and documented serialization

Kian-Tat Lim to write this up in a DMTN

EFD

Transformed EFD will contain join tables as well as aggregated value tables

Is there a requirement for access to the live EFD by any LSP instance?

Is there a DAX service for the transformed EFD and if so what is it?

Yes, and it is TAP
Need to answer units/UCD/linkage/metadata questions for EFD as well; need to talk to Dave Mills about capturing and providing this

Does the DAX team have the necessary information for doing that?

How does someone correlate a PSF measurement with wind speed, wind direction, and dome louver position?

Transformer aggregates wind speeds and directions over the duration of the visit and materializes it, possibly as average and max gust
The transformer should be modular and applicable in the Header Service, Transformed EFD, and possibly for Raw EFD clients

If there is a change in the transformation, is it applied to all historical data?

For Data Releases, yes
For the Transformed EFD, it should be, but this should create a new version and will require catch-up compute resources

Is the Transformed EFD accessible from all the DACs and Comm Cluster?

Yes, it is replicated from the Base to NCSA

Butler

How is a Butler to access the data products from a given Data Release or from "Level 1" Nightly Processing initialized?

Could take one or more environment variables to (help) instantiate the Butler — needs to be a Gen3 Butler requirement

Questions about retrieval of near-real-time data:

How do we use the Butler to find all the visits since a given time or a given previous visit?
- Registry should be able to support this query
- If these queries are complicated, we might have a library to be able to create them
How do we use the Butler to get the next visit (whatever its id is)?
- The problem is knowing the id
- Even with the Gen3 Butler would likely have to use the previous mechanism to poll for new visits
- Need to try to replace it with a blocking Butler API call

Gregory Dubois-Felsmann to ask DM-SST to develop use cases for the Science Platform as applied to Nightly data

VO Interfaces

Definitely will support these:

SIA (kind of includes SODA) v2
TAP (includes ADQL), will return at least VOTable (possibly with embedded FITS), JSON, FITS tables in FITS files; possibly SQLite; Portal (today) would like VOTable with embedded FITS and JSON
VOSpace
Asynchronous requests will follow the UWS model; synchronous can have faster latency because it can serve results from memory; synchronous queries cannot be promoted to asynchronous — they would have to be killed and restarted
VOTable data format, UCDs, etc. that are used by the above

imgserv v1 is in the spirit of SODA but is not quite there

metaserv supports TAP plus the Reg-TAP standard for the data model that can be queried but also supports additional interfaces

Therefore also doing VO Resource

Likely to support these:

Obs-Core (a data model standard for information about observations, roughly a subset of CAOM)
Obs-TAP (relational representation of Obs-Core)

UWS can be an API to a batch system, but we may not want to support that

Not planning to support SCS: very old specification built on other old specifications and can be replaced by TAP

Not IVOA:

CAOM

Schedule:

VOSpace in the fall of 2018; needs to be coordinated with Portal; also NCSA needs to understand what the underlying filesystem model will be
metaserv v1 in the spring of 2018

Performance numbers from the Portal will help the APIs decide what implementations are feasible and high-priority

Unknown User (xiuqin) to ensure that these performance requirements are delivered to DAX

In the long term will need API rate limiting

ADQL subset for Qserv is expected to be ADQL 2.3 without subqueries or coordinate transforms; DAX can list what they will definitely support, might support, will definitely not support, and when things may be delivered

ADQL for items in the Consolidated Database may be more difficult; if spatial indexing is required, it needs to be requested from NCSA and potentially Oracle and the costs of building it on top if it is not available exposed

Kian-Tat Lim to ensure that Consolidated Database requirements include spatial indexing

If ADQL functionality is subsetted, it should be exposed through TAPregext; it's possible that a subquery-less service might have to be a different TAP endpoint

Space shortcuts

Page tree