Last report on Gen3 middleware status was demo mid-December, 2019. In the ensuing couple of months, management of these efforts has transitioned from Fritz Mueller to Tim Jenness.
Status and forward trajectory update.
Extensive discussion of what “Gen3 feature parity” means.
Effectively, it is possible to move all processing jobs that we currently use Gen2 for to Gen3.
It is not necessary that all tasks be converted, or that the registry schema be stable.
Demonstrated by e.g. running a DRP processing, AP processing, etc.
Need to define intermediate goals and milestones for middleware development, e.g. support for OCPS.
Support from Project Management (ie, Wil O'Mullane) for Tim to effectively tell us what/who he needs and then the project will support that.
Scope was to understand use cases and make suggestions for potential tooling.
Not to design “one overall image display system”.
Yusra provided the meeting with a summary of the results; the interested reader is referred to DMTN-126.
Invitation for DM team to take part in an Astrowidgets workshop.
Chris Waters is currently using Ginga/Astrowidgets extensively.
Simon Krughoff has been in regular contact with Eric Mandel, JS9 developer; he has been very responsive.
Mandel has provided a Docker image, which may make it easier to deploy JS9 in the browser.
There is a JS9/electron as a desktop app that could potentially be a replacement for DS9 if we want to unify user experience.
the possibility of including Ginga and/or JS9 in the Nublado environment.
DM-PORTAL milestone in 2021 is to restart portal development - the portal is a little different to this perhaps (firefly blends these) - we should discuss.
If we restart end 2021, then early in 2021 should we do a technology survey, possibly in conjunction with EPO. When DM-PORTAL was added no prior package for exploration was added (nor milestone)
Wil O'Mullane requests a “bake off” between Ginga and JS9 in the LSP JupyterLab environment.
This means making these tools available to observers and seeing what they like.
As well as a technical evaluation from the LSP team as to what they can support.
Wil regards this as high priority, but not as high as the EFD.
Tension between “notebook CI” and this means it is unlikely to be available for observers until late Spring.
Will get a date for that on Thursday.
This bakeoff is not the same as the portal decision.
Yusra AlSayyad/ John Swinbank — capture feedback from observers about their interactions with Firefly (this might go in the WG report, or it might be a separate document).
Frossie Economou — get JS9 and Ginga integrated into the LSP JupyterLab environment (date TBC).
What is the model for managing calibration products during the operational era?
For example, are calibration products versioned through git repositories (as they are during construction)? Are they exclusively managed through the Butler? At what points can data be “ingested” to a data repository?
My understanding is that the middleware and calibration products teams have built an impressive toolbox of technologies that can be used to implement whatever data management policy we want, but that nobody has yet written down what that policy should be... and many people have different, incompatible, implicit policies in their heads.
Raw calibration data is not included here; all raw data is kept together in Butler repos in the Data Backbone regardless of purpose. Not; all raw data together no matter purpose
The logic to choose which master calibration products to use when processing data is currently undefined.
We need to define the available technical solutions, but more so we need to define processes and procedures, not just the technical design.
Would a more active product owner or an assistant help with pushing this definition, without needing a working group?
The next Ops Rehearsal has dealing with master calibrations as an explicit part.
Middleware has been defining one strategy via RFC on obs_lsst_data for pipeline-generated products that are converted to human-readable forms for acceptance and curation in a git repo. Simon and Tim are working on a DMTN describing this.
Tim Jenness — Extend the DMTN to solve whole problem or at least describe the future questions to be answered; leverage/delegate to Chris Waters to avoid overload.
This proposal attempts to alter the means by which image creation and forwarding would occur, replacing elements central to the overall data management and prompt processing systems.
At the moment we write two FITS files out for (almost) every observation: one from the CCS, one from DM code. The proposal is effectively to eliminate the latter.
The proposal would have some impact on the Tony Johnson / Camera Team schedule for the CCS Image Writer; we would provide some support.
How will we use the channels freed by not running the DAQ client over DWDM?
Are we still transferring data twice, or can prompt processing data be written to the DBB?
Should not be necessary to do this twice given the lack of crosstalk correction.
But the data representation might not be ideal for DBB.
There will be a copy at the base for the OODS.
System could be evolved after it is up and running.
Agreed that duplicate transfers would be “non ideal”.
Is there a potential for another catchup buffer in this architecture?
No; the camera is not maintaining a buffer of images.
Catchup is pulling data from the DAQ, and reconstructing a FITS file? Is it using the same code as was used to write the data to start with?
Not clear yet; still has to be investigated.
Catchup should definitely be based on CCS Image Writer, rather than Forwarder, code.
Alert science in early operations would be enhanced by incremental template generation prior to DR1.
How much effort would be required of the construction project?
Do we have estimates of the operations effort to run it during LOY1?
Incremental: what does it mean?
We make a template once in year one, and then we don't modify it after it has been made. We don't keep adding to existing templates.
We are aware of coverage & overlap issues here.
How many images are needed per filter to make a template?
3 is the number Eric likes, but there is some ongoing discussion.
3 is consistent with requirements on image noise.
Do not expect any form of DCR correction in year 1.
No disagreement with Eric's estimate of pipeline & workflow development.
Ops plans are not yet clear enough to speak directly to Eric's plans for Execution and QA, but they sound plausible.
Computing impact:
Absent templates, what happens when images arrive at the LDF (assuming no alert production).
Enough single frame processing to return telemetry to the scheduler.
Template generation would be run at end-of-night.
Eric says prompt-processed-style PVIs would be sufficient for incremental template generation; don't need the more elaborate DRP system.
How do we manage the impacts of some data being available to users before project-provided data products? How do we prevent our own users from scooping us? What products are we producing? How do we prevent everybody trying to use the LSP to access the data and do their own reductions?
Some of this can be controlled with throttling.
Agreed to return to this topic at a future meeting.
We will LCR expanding the construction scope as proposed by Eric.
Then it will be Bob's call as to how the Operations team reacts.
Wil O'Mullane — schedule a discussion about rolling out data products and capabilities to users without having them scoop the project or swamp our resources.
Eric Bellm — submit an LCR describing changes to the construction plan to enable incremental template generation.
There are many DPDD update tickets; need to prioritise getting them done.
Speed of development for Pipelines vs. DPDD changes is very different; impedance mismatch.
It is not necessary that the DPDD list everything described in the SDM; it's also possible to queue up DPDD updates on master rather than baselining them as they arrive.
What is the “missing link” between the SQL schema and consumers (Qserv, etc)? Is it Felis?
Who is maintaining Felis since BVan left DAX?
Suggestion that a testing framework is necessary.
Hsing-Fang may have the best sense of what is the next most useful utility to be added to the Felix toolkit, and she would be in the best place to make this happen – consensus that Hsin-Fang will be the Felis maintainer.
Changes to BaselineSchema.yaml should be change controlled.
Need to write a technote on what the schema is, where it's used, where it's going, etc. Some tension between providing enough visibility into what's happening without overly constraining or overloading the people who are doing the work. Agreed that Wil would do this as a compromise.
Leanne Guy — produce a plan for interaction between the DPDD and the concrete SDM schema.
DM-23818
-
Getting issue details...STATUS
Fritz Mueller — find somebody to update the online schema browser. (Igor Gaponenko assigned)
Kian-Tat Lim — arrange for the schema browser to be removed, until & unless the action to update it comes true.
Colin Slater — ensure change control policy for BaselineSchema.yaml is documented. In progress on
DM-23614
-
Getting issue details...STATUS
Wil O'Mullane — write a technote descibing his understanding of schema management
DM-23658
-
Getting issue details...STATUS
.
We note that providing a service backed by Parquet files is just one possible use of Parquet.
Refined scope for this session: do we store the data that we make available in Qserv in Parquet files?
The DAX team view Qserv partitioning as an internal tuning parameter, rather than something that should be exposed through public data products.
Move for a hierarchical representation like e.g. healpix, independent of either Pipelines or DAX representation.
We already use HTM for reference catalogs.
Worried about making a one-size-fits-all approach to download — likely need both filesystem and object storage.
Also should consider a CDN.
Note that IRSA userbase very much wants bulk download, and almost all catalogs are available in this way.
Some concerns about agency views on data rights.
We recognize that at this is a likely upscope, which we should identify.
We should not refer to this as a “bulk download service”.
Robert Gruendl — prepare a technote defining the meaning of “bulk download”.
Unknown User (mbutler) & Gregory Dubois-Felsmann — identify existing requirements, or suggest new requirements, for a user-facing ”bulk-download“ service (but not under that name).
Full-bandwidth testing is pending availability of the forwarders; these are not currently being procured due to uncertainty over the post-crosstalk-descope data acquisition design.
Do not regard this lack of testing as a major risk.
LSST Security Summit is coming up in April. Agenda unknown (until then, talk of encryption is just speculation).
Query whether there should be a full VNOC at the summit, given that it is likely to be staffed during the night.
Query as to whether international partners need VNOCs.
VNOC is a small set of servers, directly measuring aspects of network performance (dropped packets, etc), and providing a facility to document network events, together with a transmission of that information to a central collecting point, which then publishes to web portals.
This is the same text fixture as was used for SQL system evaluations.
Current hardware provides 1–3 months of experimentation; then another couple of months of cloud experimentation; should have a costing on the Cassandra system sometime in the summer.
Should report on this at the next DMLT.
Use caution when comparing absolute values between the SQL and Cassandra results presented.
The DAX group will push the Cassandra investigation as far as they can, but will jump to a custom solution if they find it to not be viable.
Fritz Mueller — report on progress on Cassandra / APDB to the DMLT. (Report deferred to next DMLT due to reassignment of Andy S to middleware task.)
Brief discussion of the plan for Ops Rehearsal #2, which is coming up soon.
Longer term discussion. What are our future operations rehearsals? Are they being scheduled to reflect particular hardware deliveries or other capabilities, or based on the calendar? Are we really treating them as “operations rehearsals”, or are we misusing this word to mean “integration exercise”?
We should be clear that making data available “through the LSP” means more than just having it accessible on a filesystem through a Butler.
Expectation is the rehearsal terminates after running pipelines and simple QA; no data being made available for community inspection.
Note that “prompt processing” in these slides are in scare quotes for a reason — they are not LDM-148 Prompt Processing Service processing, but just data processing that takes place soon after data has been acquired.
Kubernetes cluster at the base is about a week away.
Keen to run what verification we can during the ops rehearsals.
Some consensus on moving operations rehearsals away from hardware delivery dates, not least because hardware become available will almost certainly be immediately pressed into use.
Only hard part in terms of Gen3 middleware is making data incrementally available.
Ie, incremental visits arriving, contrasted with a complete data release.
John Swinbank would be a good point of contact for information on and coordination of pipelines activities.
Wil O'Mullane (with Bob Blum) — coordinate schedule for Ops Rehearsal #2 with the LATISS team to make sure that we aren't disrupting LATISS engineering work.
This discussion is in part a response to discussions that arose at the AAS meeting around access to Rubin Obs. data.
Can we make a specific statement acknowledging the challenges involved in providing public access to Rubin data?
Even coming up with a plan here is outside our formal scope, and it's clearly not a day-one problem for Operations.
Should the DMLT be doing anything here, even though we care?
Broadly: no, although we shouldn't do anything that'll make it harder to solve this problem in future.
Wil O'Mullane — write a paragraph for the SAC describing the DMLT's professional opinion on how we might make old data releases available in operations, should we be asked to do so. Done ... DMTN-144
Tests have been performed on satellite trail rejection.
The uncertainty on Tony Tyson's claim that we may lose 30% of images is that it's not clear how different future satellite constellations will look from precursor data, and we have some technical concerns with some of the analysis which has been performed to date.
HSC has a narrow field of view, and a relatively small survey time allocation; just been lucky it's not seen any so far.
Last report on Gen3 middleware status was demo mid-December, 2019. In the ensuing couple of months, management of these efforts has transitioned from Fritz Mueller to Tim Jenness. Status and forward trajectory update.
What is the model for managing data and data products during the operational era?
For example, are calibration products versioned through git repositories (as they are during construction)? Are they exclusively managed through the Butler? At what points can data be “ingested” to a data repository?
My understanding is that the middleware and calibration products teams have built an impressive toolbox of technologies that can be used to implement whatever data management policy we want, but that nobody has yet written down what that policy should be... and many people have different, incompatible, implicit policies in their heads.
We should develop and advertise a clearer plan for how non-Data Rights holders can access data release(s) that are no longer proprietary. Bulk access through a cloud host? Unauthenticated API or Portal access? Something else? More if they pay?
Status of drill down tool for analysis of pipeline outputs
Alert science in early operations would be enhanced by incremental template generation prior to DR1. This would be new scope–how much effort would be required of the construction project, and do we have estimates of the operations effort to run it during LOY1?
How are we scheduling future operations rehearsals? Are they tied to particular hardware deliveries / system capabilities being available, or are they purely time based? What can we nail down now, to enable them to be used in planning V&V activities?
Outline for second rehearsal is in current LDM-643.
Firefly has been used in support of early LATISS operations, and has thrown up some problems. What is DM's response? Should consider future Firefly development plans, the report of the Image Display WG (DMTN-126), and the possibility of including Ginga and/or JS9 in the Nublado environment.
Current proposal is being discussed here. This proposal attempts to alter the means by which image creation and forwarding would occur replacing elements central to the overall data management and prompt processing systems.
Cassandra has been chosen for evaluation as a potential platform for implementing the APDB, and hardware has been procured and deployed at NCSA to support this evaluation. Report on progress of this effort and possibly early findings.
1 Comment
Wil O'Mullane
I see I am moderator and leader of part of the session (image display) - I thought the moderator should not be running the session ..