2022 October 18 and 19
This meeting will be virtual;
Join Zoom Meeting (DMLT link)
Meeting ID: 934 1206 5536
All technical sessions of DMLT vF2F will be open to DM members. Should there be need for a closed session it will be marked "DMLT only".
Day 1, Tuesday
|Time (Project)||Topic||Coordinator||Pre-meeting notes||Running notes|
Moderator: Wil O'Mullane
Notetaker: Frossie Economou
|09:00||Welcome - Project news and updates|
Project news etc
- reminder to onboard new hires
- Future DMLTs
|09:30 (DMLT only)||BPS plugin support/Processing USDF/Summit|
Decide whether we have an opinion on summit batch and developer-driven processing (parsl vs condor vs panda). We are happy to support 2 bps plugins (probably not 3).
- do we want to support these plugins outside usdf/summit [jb]
- discussion for support model esp. re summit [CS/FE] support for the RSP model - clock is ticking
- after bake-off winner goes on summit
- parsl over k8s?
|10:00||User Generated Data Catalogs||Report on a general design and pieces required for support of user generated data products in the RSP, so work planning can begin.|
User-generated data products
- for temporary tables user only interacts with the TAP server [gpdf]
Moderator: Frossie Economou
Notetaker: Kian-Tat Lim
|11:00||Commissioning Cluster, Yagan, and do we have what we need?||Primary objective: Do we need to buy more nodes to increase the cluster's cores and ram? Secondary objective: What do we need to run in Yagan|
RHL: Initial spec was 2 cores per CCD (400 cores).
Right now not much being used out of 640 cores.
New nodes will arrive 7 months after order.
What is running where now?
Most heavy compute like full focal-plane wavefront will be done at USDF, but rapid analysis will be done in this cluster.
Power and cooling and space concerns? Far from space, close to limit of number of power sockets, but new UPS installed soon, so will have the ability to support more; limit is at least 1500 cores.
Anything that happens at the Summit must happen at USDF (or be transported there) as well in order to satisfy users who will want the same data products.
Can estimate based on number of parallel pipelines running in order to provide feedback to operators in near-realtime? Maybe 2-4 such? (Therefore 400-800 cores.) FAFF is defining these; they may be different than what is in Alert Production (e.g. PSF estimation).
Must distinguish between metrics that "must be there" (or will waste telescope time) and "mostly needs to be there" (can come from Alert Production).
USDF resources are also not particularly well-motivated.
Jira is also an issue for Summit independence.
Maybe start with an extra 400 to be safe, then adjust when we see what happens when ComCam goes on sky.
|11:30||OPS planning and milestones|
Link to Data Facilties google doc for epics planning (and used in the meeting for adding more milestones)
Need to identify milestones that interact with CET, etc.
Epics are activities and milestones are points in time.
Only two milestones in DP right now:
Appropriate to have milestone for "Live image data+metadata from ComCam exposed to RSP so that it is available via ObsTAP" (work is in Middleware). Can also add "DRP works at USDF" (blocks "Ready for ComCam"), "Alert Distribution works at USDF" (not a blocker for anything, equivalent to unclaimed Construction milestone).
A number of other USDF-based but cross-team milestones from Richard.
Distinguish between Ops work to operate systems and Ops work to build things not in Construction requirements; may affect which milestones to tie things to and whether to block milestones.
Figuring out Ops processes (and documenting them) may be considered an Ops activity; should have a milestone for defining these for ComCam Operations.
Secure servers needed before end of 2023.
All USDF work is considered Ops.
Need to have a mechanism for caching of images at Cloud DF, but not an issue until DP2, and this is an optimization anyway.
|12:15||Meeting free weeks 2023|
Agree dates for meeting free weeks 2023
Should not overlap meeting-free weeks with reviews, as those people need the weeks the most.
Perhaps alternate September week as Chilean holiday week (this year) with US Labor Day (2023-09-04).
April week the week of (2023-04-10).
June week the week of Juneteenth (2023-06-19).
Moderator: Leanne Guy
Notetaker: Wil O'Mullane
|Prompt Processing migration to the USDF||slides|
Forging ahead at USDF - no longer supporting GCP.
Tech selected - end to end demo not yet demostrated - ETA one month
AuxTel perhaps end of year.
Need developer system, CI and Deployment + configs for test/int/production
RHL - what area are we prototyping so we can make templates. Ian spoke to Eric Deneihy about focusing on a specific area.
Colin - Knative what is it ? Equivalent of google Cloud Run. Feeds jobs to workers, can spawn more and shutdown as needed. Each worker retrieves calibrations and templates while waiting. We may want more control over which requests go to which worker and when they startup/stop. Its kubernetes pods with our stack. KNative can do some translation between kafka and webhooks .. our code gets executed with context of next visit event, it then waits for image arrival and after that is all our code.
Frossie - where is axix of pahttps://jira.lsstcorp.org/issues/?jql=labels%20%3D%20Prompt-processingrrellisaion? its per detector - its 189 indivdual wokers (more actually since one will not finish when next starts 378). Are they persistant ? Not sure KNative can do that - on cloud they were. Cold start is not terible - next image message is well in advance of image.
This is not condor/slurm/parsl since we need to config for next visit - proposal to replace a lot of the back end with a batch job under panda instead of worker (technically a pilot job)
Retries at Knative level - or we drop image and redo in morning (since 1 minute window).
Richard OGA rack - only ceph object store and butler repo, APDB, long term storage all compute out side
Steve - all 189 Knative workers at same time ? Should await webhook - but we would pre start them. Object store sends message to activator code which runs ingest image. Copy image to local store.
Colin prep butler etc special - is the pipline a piptask ? Its SimplePipelineExecutor , exactly same pipline.yaml as elswhere - does not have to be AP could be anything.
Leanne - how much is at USDF ? Have Kafka, JNatve wstarte, Ceph object store - some messaging missing and our pod not yet started in KNative. Simulation writing to store an next image etc need to be done . Almost ready to run when KNative pod can start - have an APDB set up.
Ian - dev/comissioning right now we have 1 prompt prococessing system on a special server. Will we need a distinct dev system to avoid miessing up ?> DO we need one per camera ..
KT: SOme parts may be shareable, will need dev vs prod end points. Need endpoints per instrument.
Ian what we have now is dev and we will need a frozen version for AuxTel. Could deploy mulitple end points for KNative in single K5S cluster.
Colin - please have Dev/Int/Prod cluster Need to deploy at USDF, TTS and Summit
Frossie asks its prompt ? KT it could be used for rapid analysis
Frossie: TTS is already busy and we can stomp on each other easily - may need more disk etc to deploy this there..
- NB: IVOA meeting "DAL" session, mainly on ADQL, is 13:30-14:30 PT on Tuesday 2022-10-18, partially overlapping the planned "Prompt Processing migration to the USDF" session.
Day 2, Wednesday
Moderator: Wil O'Mullane
Notetaker: Yusra AlSayyad
Status and plans:
>>> presentors = ['Leanne', 'Cristian', 'Frossie', 'KT', 'Fritz', 'Stephen', 'Ian', 'Yusra']
['Fritz', 'Frossie', 'Yusra', 'Cristian', 'KT', 'Ian', 'Stephen', 'Leanne']
Wil: And Andres is going to give a talk in Spanish for us next week.
Wil: Cristian, we should talk with Kevin and give him a delivery time so that he can put in P6 a date that it will be in place. Then we can communicate with NSF about something happening. We have a milestone but need an activity.
|Moderator: Wil O'Mullane||Notetaker: Ian Sullivan|
|11:00||Needs for "tactical" databases||Robert Lupton||See RHL's rough notes|
|11:30 (DMLT only)||Wrap Up/Actions/AOB/|
|Topic||Requested by||Time required (estimate)||Notes|
|Prompt Processing migration to the USDF||45 minutes?||Prefer scheduling after lunch Tuesday|
|Meeting free weeks 20223||15m|
Agree dates for meeting free weeks 2023
|User Generated Data Products||30m|
|Commissioning Cluster, Yagan, and do we have what we need?||30m or less|
|Processing frameworks at USDF||30m|
Re: developer-driven processing. Closed-session please.
CTS: I think this is the same as what Tim is proposing below
|OPS planning and milestones||@womullan||1h||See the epics we have and what milestones make sense - block milestones with epics.|
|BPS plugin support||Decide whether we have an opinion on summit batch and developer batch (parsl vs condor vs panda). We are happy to support 2 bps plugins.|
Action Item Summary
Description Due date Assignee Task appears on 15 Mar 2022 Frossie Economou DM Leadership Team Virtual Face-to-Face Meeting, 2022-02-15 to 17 18 Mar 2022 Kian-Tat Lim DM Leadership Team Virtual Face-to-Face Meeting, 2022-02-15 to 17 15 Nov 2022 Gregory Dubois-Felsmann DMLT meeting-2022-10-24 24 Apr 2023 Gregory Dubois-Felsmann DMLT meeting-2023-04-17 04 Sep 2023 Leanne Guy DMLT meeting-2023-08-21 DMLT meeting-2023-09-11 Wil O'Mullane DMLT meeting-2023-09-11