Why client/server

  • 10,000 astronomers with data rights will be on Google with Rubin accounts but will not be allowed to have accounts at SLAC on the USDF.
  • Client/server mediates access to the data at SLAC.
  • The server provides a barrier between 10,000 outside astronomers and the database.
  • User tokens on the RSP can be used to determine which datasets someone can access.
  • In the future this could allow people to access the data directly on their laptops without the RSP.

Current Status

  • David Irving has joined the team and is making a fantastic contribution to Butler client/server.
  • David has already deployed to the RSP on Google a version that can support find_dataset and get() and can be used by the cutout service and data linker.

General Assumptions

  • DRP Campaigns will never use client/server. Campaign processing is run by trusted users doing large graph builds where direct connection to the database is more efficient.
  • Client/server is not needed for summit operations.
  • ComCam commissioning data will not be made available to the data rights community until released as part of DP1.

Assumptions for DP1

  • DP1 (comCam) will be using client/server.
  • We will baseline a CloudSQL registry in Google with data at SLAC, but can fall back to putting the data on Google as for DP0.2.
  • DP1 will support all queries supported by direct Butler.
    • There may be a cap on the number of results that can be returned for any given query.
  • DP1 will not need to support Butler put().
  • There will be no group management. Everyone will be able to see all of DP1.
  • ObsCore will be implemented with a static table in Qserv as for DP0.2.

Prompt Processing Outputs (during survey operations)

  • Once the embargo period is over, raw data and prompt products must be made available to data rights holders.
  • The data and  the registry will be hosted at SLAC.
  • Access must be mediated by client/server accessible from Google RSP.
  • This registry must include a live ObsCore table accessible by ObsTAP.
  • This is required on day 1 of survey start.
  • Data products are continually deleted from this repository in a rolling window, but raws must always be available.

Questions

Provocative

  • Do we need to support butler.put ? If we have a way for writes to be batched (eg from workspaces or user batch) do we need put()?
    • Could we provide a notebook-friendly butler.put  equivalent that writes to the local user file system without using a registry at all?
    • A "butler lite" that uses the file system to store the dataset refs and datastore records which has limited query support and not even Sqlite.

At USDF

  • Do staff need to use Butler client/server when logged into a terminal at SLAC? Why?
  • We are not prioritizing graph building over client/server. This implies that bps submit would (initially at least) want to use direct butler.
  • Staff RSP can use client/server since large graph builds are not something usually attempted from a notebook.

Prompt Products

  • Is there a requirement that data rights holders can write to the prompt products repo? 
    • How is that quota-ed?
    • When the "30 day" window expires all the automatically-generated products what happens to the user-generated products?
    • Are they supposed to transfer them to Google?
    • Is user-batch supported on this repo?
  • When PVIs etc are deleted, are the registry records retained (thereby keeping them visible in ObsTAP) or are they purged on delete as if they never existed?
  • Who is configuring the ObsCore system to determine which dataset types and collections are visible?

On Google:

  • For a data release is ObsCore static or does it need to be updated with user data products?
  • Do we need to cache butler.get on Google?
    • Do we put all the deep coadds on google in permanent cache and only go to SLAC for per-visit datasets? 
      • This is a much simpler implementation of caching that does not require proxies or cache expiration.
    • If server is issuing a pre-signed URL to a file at SLAC how does the dataset end up in a cache at Google?
      • We do not want to pre-fetch into the cache server-side if the client does not end up using the signed URL.
      • We have heard that signed URLs from SLAC might not be supported at all. When would we know?
  • User data products:
    • When someone writes user products to the Google butler, where does the data go?
      • Presumably Google but might it go to SLAC to support user batch? (DI: Why would we want it to go to SLAC?)
        • Answer: We have no money for Google userbatch data volumes. Data must always be stored at SLAC.
      • How is it quota-ed?
      • Is u/username/ in the bucket quota-ed? How are quotas managed for collaborations writing to g/groupName?
      • Does it all go into a DRn user bucket? Or do all butler puts for a user go into a per-user / per-parent repo bucket?
    • If someone does a butler put to DR2 repo and then wants to look at that data when they are connected to DR5 repo, how does that work?
      • We aren't promising to keep the DR2 repo accessible when DR5 is released but do we keep the DR2 user outputs bucket around forever?
        • How do we link those files to a DR5 registry? Do we keep the DR2 registry around without the DR2 files?
      • Do we migrate everyone's user products every year? Losing provenance? On demand migrations? If DR2 is removed from Google do we delete the associated user data?
        • Do we give them a migration script and say they have one year to run it to transfer DRn-1 user products to DRn? 
        • Will any scientist really let us delete a data release?
          • Do we keep the DRn-2 registry and user products around but remove the DRn-2 data products?
    • Can we have a per-user butler repo instead?
      • People can "transfer" data of interest to their personal butler.
      • We can auto migrate the schemas of these personal butlers to keep in sync with software updates between releases.
      • Jim has mentioned the idea of "workspaces" where people transfer records from the DR repo to a local sqlite and do everything locally.
    • How would collaborations work if puts are not going to the DR registry?
    • Does the cutout service have to work with user products? How would it find them if they are not in the main registry.
  • User Batch:
    • Processing must run at USDF, but graph building must happen on Google if user data are in butler at Google.
      • User data products must be allowed to be inputs to the graph build.
    • Client/server may not be efficient for large graph builds so this likely will have to use a direct butler once the batch submission is approved.
    • User data then has to be transferred to SLAC if it it at Google.
    • If user batch products have to be at Google they have to be transferred back to Google from SLAC (along with the provenance information from the graph).
      • What if the volume exceeds the user quota?
      • Same question as above about where the user data ends up.
  • Is there an expectation that a single notebook can connect to a DR1 and DR3 butler (say) at the same time? (as opposed to a DR1 notebook talking to DR1 and DR3 notebook talking to DR3)
    • (DI: I was under the impression there is no such thing as a DRx notebook, and that there was just going to be a single supported version of the RSP.  Asked on #dm-square, hopefully someone will answer.)
      • Notebooks that worked fine on DR1 might not work on DR5 without edits due to API changes in pipelines code.
    • If yes, has DR1 server been updated to match any changes in DR3 client software?
      •  (DI: I think it has to be?  It's not plausible to just leave a server sitting there unpatched, we're going to need to do continuing upgrades for security/performance/deployment changes/etc.)
      • ie we can patch the server software that is accessing an older database, but still using the older butler in the server with older schema.
      • Presumably we will not be doing schema migrations for DR1 to forward migrate to DRn compatibility?  (DI: presumably many schema changes will be to fix performance problems encountered during operations – I don't see how we're going to be able to avoid upgrading schemas on older DRs to some extent.)
        • TJ: Related question: Are we planning to keep all DRn registries online even if the data disappears to tape?
    • Or does DR3 client software have to be able to talk to a DR1 server?
      • DR11 talking to DR1?
    • Does a DR11 notebook have to be able to read data files written by DR1?
      • Versioned formatters.

Other

  • Who is leading user batch development? USDF or Data Abstraction?
  • Are we wanting to be able to "mint DOIs" for user butler collections and make them public? (Similar to what CADC can do with VOSpace directories).
    • Can DOE issue DOIs?

Internal

  • How many server nodes do we need?
    • How many database servers?
    • Are users pinned to a specific server if they start doing butler put? (DI: That "specific server" would always have to be the master – I don't think pg supports a replication mode where writes can be processed independently by a replica.) Does the put block until information about it has replicated to all databases? (DI: This has major availability problems – if one replica becomes slow, all puts are blocked.) We can't have butler get failed immediately after a butler put if the get ends up using a different server. 
  • Do we cap the number of results coming back from a query? If someone is getting a million datasets back did they really mean that or was it a mistake?
    • Are all queries async internally with workers doing the query itself?
    • Can we leverage some of the TAP async infrastructure to support this?

Actions

  • Tim Jenness discuss with team whether we need to finish migrating FileDatastore so that it no longer directly access the database but uses records.
  • Tim Jenness to discuss with Fabio Hernandez whether the default for tmpdir in HTTPResourcePath can be changed to tempfile.gettempdir()  rather than os.getcwd() .

The conversation Dave wants to have

They've only scheduled one hour for this breakout. The main things I urgently want to discuss with these people are, in priority order:

  • Is it acceptable to not write user-generated data into the same Postgres DB as the data release itself?  Many difficult technical, operational and political problems become more tractable without this requirement.

^^^^ if we discuss nothing but this and are able to get to a "yes" I will consider the trip a success.  Beyond that

  • What is the story for how the end-user facing Butler will interact with prompt data products.
  • Is it understood that Rubin staff will continue using DirectButler for many tasks
  • For the end-user facing RSP, we may need to host some Butler database replicas at SLAC, and potentially the Butler server instances to go with them as well.  Are DBA resources available to support this and what is the process for requesting this infrastructure.
  • Relationship between Butler query system and RSP portal
  • No labels

2 Comments

  1. There are suggestions that it may be difficult to issue pre-signed URLs pointing directly to SLAC/USDF storage.  (But there are at least two ways that still seem viable for making this work, even if a third preferred way ends up not being viable.)  Two other alternatives would be to issue pre-signed URLs pointing to a Google/DAC-hosted service that proxies pass-through access to SLAC/USDF or to actually copy the data from SLAC/USDF to a Google/DAC GCS cache, then issue a pre-signed URL pointing to the cache.  Are both of these compatible with Butler client/server?

    1. The proxy service concept should be viable if that becomes necessary.  Always copying the files is more problematic – there are use cases (like datalinker) that assume that it is fast and cheap to obtain a link to a file.