API Design and AAIM

From yesterday:

Authorization is done through group membership; there will be a naming convention (to be cleared with NCSA LDAP team), and the CILogon API will return both group names and gids

Groups will include functional rights (ability to use particular services) as well as data rights (both "public" and "private")

How can you determine whether an operation is authorized?

  • Do you have to try it and fail or can you ask some service/library in advance or does every dataset/table/whatever have an associated list of authorized groups?

Preference is for a microservice that would return a yes/no given a token and a group name or gid (that is from a service configuration); for now, this should be a trivial "extract group list and string/id match" operation

Could also be done using LDAP, but prefer not to for now; LDAP may be more current and can deal with long lists of groups, which could grow to hundreds per user; later, we could use LDAP as the back-end for the microservice

  • NCSA should write this microservice; James Basney accepts this responsibility and will propose a REST API by the end of January

When a VOSpace request is made, bearer token goes into the header; service can get the JSON web token and cache it

  • Brian Van Klaveren to specify how the token information is received by each Web API service; this should be a few sentences in the documentation as well as a recommendation for how to use existing client libraries (AstroQuery, PyVO, etc.) to support OAuth; if absolutely necessary, we could provide our own client library

Is it possible to create user tokens that are read-only?

  • This is an interesting and likely useful functionality but is not trivial to implement
  • Figuring out what "read-only" actually means is also an issue, probably more like "do no harm"

TOPCAT: third-party application developers have challenges working with our security model

  • Third-party applications running on data-rights-user-provided hardware accessing LSST data via Web API Aspect
  • Application may need to launch a Web browser to perform the login, then returning the token to the application may be a challenge
  • Redirect to application URL registered with OS/browser might be a possibility
    • Likely means that the application has to be signed
  • Another way is to return a copy-pasteable token; this is good for long-lived tokens but not short-lived ones
    • Having a "renew this token" flow might be an amelioration

Central user profile

  • Query history, including hiding, would be large and so would go into Aspect-specific storage
    • Is this a TAP query? Perhaps, could be stored in a MyDB table
  • Default Data Release could be cross-Aspect
  • Storage of Alert Filtering configuration might go here or elsewhere
  • Unknown User (awithers) will confirm that general key/value information can be put into the CILogon system, but this is no longer urgent given discussion below
  • Need to confirm that storage and APIs (LDAP) are sufficient
    • May make it difficult to bring up in other environments or to build sandboxes
    • Binary data and user-settable data may not be ideal in LDAP
  • Should be instance-specific, not global
  • Preferences directory in the User File Workspace; not a single file because we don't understand all the data being stored
    • Redis or etcd could also be used but only if the above is found to be inadequate
  • Examples of preferences:
    • Default heuristic for picking a container for the Notebook (e.g. always last weekly)
    • Default Portal layout, image stretch preference, coordinate system, x/y plot, table configuration (which columns are displayed)
    • Localization like language, timezone (seems more like LDAP user profile); alternative usernames/identifiers (e.g. on GitHub or Slack)
    • How much query history is to be stored could be an API preference
    • Default Data Release

Design document needs to include a subsection on security policies/issues for each component

  • How this component relates to/implements top-level security policies
  • Any specific issues of interest to this component (e.g. we don't allow root access in user containers)

SODA service and Portal compute plugins need to execute on behalf of a user

  • Today imgserv runs as the webserv user
  • It will in the future need to run as root and then seteuid to the authenticated user whenever a butler.get() is performed; Firefly would do the same thing (using a forked setuid executable, possibly started at the beginning of the user session)
  • Need to have a requirement that instantiating a cached and to-be-reused Butler never retrieves information that would be inaccessible to the get() user
  • A "data release" Butler could be used for a data-rights-holding user accessing data release products
  • This type of code will require auditing

IVOA services should not write except TAP and VOSpace

File Workspace is best as a subdirectory of the user's "home" directory; that home directory is only accessible via JupyterLab; everything in the File Workspace is accessible via VOSpace (and WebDAV)

Are the accounts that people get "NCSA" accounts?

  • They are in the NCSA uid and username space, but they have no authorization by default to do anything else at NCSA, so effectively no; should mention this in the design document but this is an operational choice of NCSA
  • Unclear if current NCSA users can gain LSST rights or how LSST users gain NCSA rights but either of those is feasible given this system

Deployment Issues

Management DomainInitial AvailabilityPurposeIntentDeploymentLocal Qserv DataBatch ResourcesRaw Data, Data Release Products, Metadata (DBB)User Workspace

Unreleased DR and Intermediates

Live EFD &
Obs Ops Data

PDAC2017Development

Developing Portal by SUIT devs;
developing Notebooks by SQuaRE devs;
developing Web APIs and Qserv by DAX devs

By SUIT for Portal,
by SQuaRE for Notebooks,
by DAX for Web APIs and Qserv

Includes both development deployments
and more-stable deployments that the other Aspects can rely on

SDSS S82, WISE, HSC, Simulated data (2018-?), LSST DRNCSA BatchYesYesNoNo
2018IntegrationIntegration & testing at LDFBy LDFNCSA BatchYesYes?No
Science Validation2018-12Science Validation

Pipeline development &
DR validation for Science Ops staff

Analysis of spectrograph data

By LDFIn-preparation DR, precursor data such as HSC and simulated dataNCSA BatchYesYesYesNo
Commissioning ClusterScheduled 2019-01-01, but dependent on Base occupancy on 2018-03-01Commissioning Cluster

Rapid analysis for Commissioning Team
and Observatory Ops staff

By LDFNone?

Commissioning Cluster Batch, Base Batch, and
NCSA Batch

YesX?Yes
US DAC2021? for testing, possibly earlier if used for Commissioning Data ReleasesUS DACScience analysis for data rights usersBy LDFLSST DRs, imported catalogsNCSA BatchYesYesNoNo
Chilean DAC2022Chilean DACScience analysis for data rights usersBy LDFLSST DRs, imported catalogsBase BatchYesXNoNo

Verify that Qserv for Science Validation is in the FY19 budget plan; otherwise a monolithic database might be usable

Image File and Database Versions

Need to have raw pixel data or PVI pixels with best WCS, need to have PVI with metadata header corrected post-DR, etc.

Also need to be able to retrieve exact file and metadata headers initially released

Images will be retrieved from DAX from within collections; each collection is a single type of data product within a single DR; metadata about the collections needs to be available from metaserv

Should there be a URL distinction between Data Releases (particularly for TAP queries)?

  • That makes for many TAP endpoints, whether the distinction is in the path or the domain name
  • Prefer to have a single TAP endpoint that determines the database by parsing the query; it can then dispatch to different database servers for the back-end

Data Model

DPDD is logical/aspirational/approximate, not physical

Tabular data is persisted as table files and read back as whole files which is guaranteed to round-trip

In database tables, there's a loading process; will it round-trip? (There is no requirement to do so)

  • Will we store all generated columns in the table files in the database?

Names in afw tables are not the same as the DPDD; they may contain more information

  • cat package "baseline" schema has been updated relatively recently for the Alert Production tables
  • Slots could be used to make them the same, either by materializing them as columns or as aliases in a view
  • Alias/mapping definitions have to come from the science side and be implemented by mechanisms provided by the database and/or afw table

Metadata needed to explain columns: help text, units, UCDs, linkages between columns, possibly VO-DML

  • daf_ingest should take in additional information to load into "meta-tables" that describe the database tables
  • It will check that the afw tables that it is loading conform
  • UCDs and units should definitely be built in at the Science Pipelines level; linkages are highly desirable
  • VO-DML is an extremely complex specification, might want to push something back to them

Pipelines need to audit columns in afw tables and see if there is any custom transformation that is needed (e.g. angles in radians get converted to decimal degrees)

Start to have catalog outputs from periodic HSC runs loaded into lsst-db

Have each of the LaTeX tables in DPDD be generated by YAML (which could replace the cat schema) that can also generate validation code to test the loaded database and the afw table outputs

Metadata should be loaded along with the shared afw table schema, rather than with every afw table

Visit tables can be loaded from FITS file headers today; there are also Butler registries today

In the future, which visits exist and registries will be generated by the Archiver in the DBB; Science Pipelines will output additional per-visit information; this could be in afw tables or in image header metadata

Using SQLite for ingest for CI runs would be nice to avoid dependence on a DB server

All DPDD data products need to have at least one defined and documented serialization

EFD

Transformed EFD will contain join tables as well as aggregated value tables


Is there a requirement for access to the live EFD by any LSP instance?

Is there a DAX service for the transformed EFD and if so what is it?

  • Yes, and it is TAP
  • Need to answer units/UCD/linkage/metadata questions for EFD as well; need to talk to Dave Mills about capturing and providing this

Does the DAX team have the necessary information for doing that?

How does someone correlate a PSF measurement with wind speed, wind direction, and dome louver position?

  • Transformer aggregates wind speeds and directions over the duration of the visit and materializes it, possibly as average and max gust
  • The transformer should be modular and applicable in the Header Service, Transformed EFD, and possibly for Raw EFD clients

If there is a change in the transformation, is it applied to all historical data?

  • For Data Releases, yes
  • For the Transformed EFD, it should be, but this should create a new version and will require catch-up compute resources

Is the Transformed EFD accessible from all the DACs and Comm Cluster?

  • Yes, it is replicated from the Base to NCSA

Butler

How is a Butler to access the data products from a given Data Release or from "Level 1" Nightly Processing initialized?

  • Could take one or more environment variables to (help) instantiate the Butler — needs to be a Gen3 Butler requirement

Questions about retrieval of near-real-time data:

  • How do we use the Butler to find all the visits since a given time or a given previous visit?
    • Registry should be able to support this query
    • If these queries are complicated, we might have a library to be able to create them
  • How do we use the Butler to get the next visit (whatever its id is)?
    • The problem is knowing the id
    • Even with the Gen3 Butler would likely have to use the previous mechanism to poll for new visits
    • Need to try to replace it with a blocking Butler API call

VO Interfaces

Definitely will support these:

  • SIA (kind of includes SODA) v2
  • TAP (includes ADQL), will return at least VOTable (possibly with embedded FITS), JSON, FITS tables in FITS files; possibly SQLite; Portal (today) would like VOTable with embedded FITS and JSON
  • VOSpace
  • Asynchronous requests will follow the UWS model; synchronous can have faster latency because it can serve results from memory; synchronous queries cannot be promoted to asynchronous — they would have to be killed and restarted
  • VOTable data format, UCDs, etc. that are used by the above

imgserv v1 is in the spirit of SODA but is not quite there

metaserv supports TAP plus the Reg-TAP standard for the data model that can be queried but also supports additional interfaces

  • Therefore also doing VO Resource

Likely to support these:

  • Obs-Core (a data model standard for information about observations, roughly a subset of CAOM)
  • Obs-TAP (relational representation of Obs-Core)

UWS can be an API to a batch system, but we may not want to support that

Not planning to support SCS: very old specification built on other old specifications and can be replaced by TAP

Not IVOA:

  • CAOM

Schedule:

  • VOSpace in the fall of 2018; needs to be coordinated with Portal; also NCSA needs to understand what the underlying filesystem model will be
  • metaserv v1 in the spring of 2018

Performance numbers from the Portal will help the APIs decide what implementations are feasible and high-priority

In the long term will need API rate limiting

ADQL subset for Qserv is expected to be ADQL 2.3 without subqueries or coordinate transforms; DAX can list what they will definitely support, might support, will definitely not support, and when things may be delivered

ADQL for items in the Consolidated Database may be more difficult; if spatial indexing is required, it needs to be requested from NCSA and potentially Oracle and the costs of building it on top if it is not available exposed

  • Kian-Tat Lim to ensure that Consolidated Database requirements include spatial indexing

If ADQL functionality is subsetted, it should be exposed through TAPregext; it's possible that a subquery-less service might have to be a different TAP endpoint