Major blockers are multi-user registries and repo stability: these seem to form the irreducible core of Jim Bosch's work.
Can we use a “friendly user” shared schema to avoid the multi-user issue?
High risk of data loss, etc.
But could do ad-hoc backups, etc.
And it's not clear how this could work in DACs.
Jim thinks that the friendly user mode could still be a big advantage, in particular to free up his time before the Algorithms Workshop, although we would ultimately need to sort out the multi-user registry.
Does Jim really need to be the person working on multi-user registries? Surely this just needs database expert?
Risk of delegation is that new developers would have unpredictable velocities.
Try to separate the issues of design and implementation, and have Jim focus only on the former.
Need to develop a technical plan to deploy a system without a multi-user registry; would still be significant work.
There are multiple DMLT pundits suggesting how this could be done.
Do users have to share the same registry as production?
If they don't, there's a lot of duplication.
Fritz Mueller — consider mechanisms by which Robert Lupton and other science users can provide feedback on user-facing middleware issues to the development team.
Jim Bosch — identify a design which enables hand-off of middleware implementation to alternative developers ASAP, probably including “friendly user” shared registries.
Note that converting Science Pipelines code to PipelineTasks is not the same as working on middleware development itself.
Concerns about addressing batch processing workflows through notebooks, interface with CAOM model and IVOA services, virtual data products, etc. — when can the middleware handle this? At what level is it an overriding priority for commissioning?
Leanne Guy — identify a product owner for middleware (or, possibly, a science owner and a technical owner). They should not be (a) member(s) of the development team.
Fritz Mueller — arrange a DMLT-level demo of the current state of the middleware.
Wil O'Mullane — identify the long-term management structure (as opposed to product owner) for middleware.
Getting issue details...STATUS
Given the discussion on datasets (above) and the ongoing and emergent requirements of the construction and commissioning projects, we should understand the current capabilities and future schedule for reprocessing efforts at NCSA. In particular:
Following the departure of Hsin-Fang, what staff are assigned to this?
How ready is the NCSA team to support regular test processing as requested by the SST (above)?
How are issues reported and acted upon?
Is tracking metrics with SQuaSH adequate?
Do we also need Jira tickets?
Who is responsible for filing/triaging/resolving issues?
What work needs to be done, either by NCSA or by other teams, to make this process as automatic as possible?
Can we read this last item as “what is the current status of the Batch Production Service”?
Each dataset being processed should have an owner (e.g. Yusra for RC2/DC2 data processing) who agrees plans for processing with the LDF team.
There should be some default agreed configuration for processing, and a process by which the owners can change that configuration on demand.
Expect the LDF to ultimately aim to become familiar with the warnings issued by the pipeline and to filter out the ones which are unimportant. In the short term, though, there may be some elevated level of warnings issued by the LDF team.
Comparison with number of files in previous run.
Look for errors being logged.
There is a generic problem that the Pipelines are poor at capturing errors in a coherent way.
This will be easier in Gen 3 (we're assured).
Where appropriate, success can be seen because relevant values are stored in SQuaSH.
(In so far as it is not covered by the item above), please provide an overview of the current design and implementation of the various services which require workflow management.
In particular, following surprise/confusion at the June 2019 DMLT F2F, we should establish whether tools like Pegasus are required:
For the Prompt or Batch Production Services;
For user-triggered processing from the Science Platform;
To respond to the recommendation from the Directors' Review that “more effort should be spent on refining diagnosis and recovery from processing errors as this will be critical for operating at scale.”
Design will be DMTN-123 when it's ready.
Is there a clear division of responsibility in terms of error reporting between PipelineTasks and the workflow system?
In general, PipelineTasks should be atomic (either they throw, or they complete).
There is broad agreement in the room that Pegasus is not a requirement at this stage, although we note that it could be re-added to the system at the appropriate time — compatibility will be maintained.
While a basic outline has been set the devil is in the details and these need to be discussed as they will set a timeframe where this rehearsal can occur
Also need to discuss whether or not verification tests should be convolved with the rehearsal
Expect OR#2 to run for ~1 week.
Timescale under debate, depending on what instrumentation is available and what is actually required.
The primary aim is to exercise people, not hardware.
When ComCam is on the summit, all data should be transported to NCSA.
Discussion about whether the ops rehearsals should focus on hardware and service delivery and integration events, or on training the ops team.
We should engqge with the SIT-COM team to determine how they can be involved in using the Ops Rehearsals to demonstrate successful functioning of the observatory, rather than just successful function of the operations team. However, this may be part of OR#3, rather than #2.
OR#2 will be based on AuxTel, in February, processing will be based on the summit.
Robert Gruendl is empowered to ask other people for help in fleshing out the documentation for the OR.
We are already at the stage that we are deploying real, operational services; over the next few years, both the number of capabilities being deployed and the number of users will increase substantially.
We have paid lip-service to the idea of configuration management, but we've not taken many concrete actions.
How do we arrive at a system that satisfies our need for rapid development and deployment while provide an adequate level of configuration control?
“Configuration management lets us know what we have; configuration control lets us know when it changes”.
The DMLT seems happy with the above technology stack, but it's not clear at what level the DM-CCB (or some other body) will actually sign off on which changes.
Discussion of whether a “canary” model can apply to DM services; it's not clear we have enough users to make this worthwhile.
Perhaps this depends which of the various DM services we care about.
Assertion that the DMLT “doesn't care” about control of stack containers going to the LSP; all that matters is about the containers defining the Juypyterlab service.
Frossie is skeptical that the CCB-as-gatekeeper would add to the process that she already goes through; Wil reckons that the point is a sanity check on Frossie's decision.
Worry that LSP developers are feeling pressure to support services when they break; more configuration control might help with this. Frossie suggests the SQR-035 model will help address this.
Conclusion is that Frossie Economou will remain in her current role of “gatekeeper”, with no DM-CCB or other direct oversight, until the SQR-035 plan has been fully implemented. At that point, we should review.
And that there should be a more rigorous system for managing access to commissioning data, possibly involving input from Bob Blum.
Wil O'Mullane — ensure that a standardized deployment mechanism is documented and required across subsystems. this is now in
Getting issue details...STATUS
Frossie Economou — develop configuration management systems based on SQR-035, and report on progress at the next DMLT F2F meeting.
Kian-Tat Lim — ensure that A&A systems are managed following the SQR-035 plan.
Getting issue details...STATUS
We've been hearing for a long time about plans to move third party packages to Conda, to adopt a Conda-based toolchain, etc. Let's have a summary of the work which has been performed to date, and a summary of and timeline for future plans.
All LSST patches, modulo pytest-flake8 and eigen, are no longer necessary.
This proposal would require LSST to set up and host in perpetuity a web-facing Conda channel.
Conda development patterns: “we do not (yet) understand what it would do to people's everyday lives”.
When can we switch all the third parties to Conda packages? As soon as the scipipe_conda_env becomes an EUPS package. There is currently no timescale.
It is not clear which group would be responsible for maintaining a Conda channel (and, indeed, SQuaRE would like to get out from under supporting stack builds in general).
Wil O'Mullane — clarify who is responsible for developing and maintaining services in support of stack build and deployment.
Status update on the work to produce revised (and simplified!) sizing and cost models.
New model is not yet done, but is ready for a status update.
No compute is currently reserved for staff (ie, for ad hoc QA, etc).
Currently assumes 2 months of LSSTCam in FY22, then full operations from FY23 onwards.
Consider making “additional DRP steps” parameter into separate per visit & per object steps.
Model does not currently include daytime solar system processing; zeroth-order assumption is that daytime processing can simply use the (idle) AP infrastructure (but these have not been shown to match up).
Model assumes full AP in LOY1. But this is a relatively small part of the compute budget.
Total spend to end of FY23 is around double what the initial estimate of $14M.
BUT we should not get hung up on these numbers for now — there are still huge uncertainties in this process, which need to be resolved quickly.
This model does not yet account for two data release productions in LOY1.
And should account for 10% of storage for users (as well as 10% of compute).
And does not yet account for IN2P3.
DMTN-135 is not yet complete, but it is ready for comments on the text from DMLT members.
Leanne Guy & the DM-SST — Compare contents of LSE-81/82 (science inputs to sizing) with results from the HSC processing (NB this is also a risk mitigation). Ticket DM-22082
Getting issue details...STATUS
Who will be working on middleware during 2020? How can we free up folks — in particular Jim Bosch — to focus on other tasks? Fritz Muellerhas agreed to come up with a plan which he will discuss at this meeting.
Status update and timeline for SDM standardization
The current monthly reprocessing of HSC data is still a very manual process run by Hsin-Fang. As we move towards the end of construction, running these re-processings more frequently and on different datasets is essential to understand the performance of the pipelines. This will not happen unless we automate the process.
For the last few years, Hsin-Fang has provided an invaluable service to the DRP team by regularly reprocessing the HSC RC2 dataset every few weeks (initially fortnightly, currently monthly) and reporting issues.
As Hsin-Fang has on from NCSA, and as we move closer to commissioning/science validation, we should review whether this is still the most effective way to proceed. Specifically:
How much of a resource drain is this on NCSA?
Can it be automated (see also the discussion topic above)?
We know a traditional RDBMS has been investigated, and there is work ongoing with Cassandra.. but what's the current status? When do we expect this to converge? What's the risk that we will simply be unable to hit performance targets?