Data Management > DM Leadership Team Virtual Face-to-Face Meeting - 2022-10-18 > DMLTOCT2022.png

Logistics

Date

2022 October 18 and 19

Location

This meeting will be virtual;

Join Zoom Meeting (DMLT link)

Meeting: https://noirlab-edu.zoom.us/j/93412065536?pwd=QWhvb29kbVI5NEZNWFMrR0dzN0RBZz09

Meeting ID: 934 1206 5536

Password: 752226

Attendees:

All technical sessions of DMLT vF2F will be open to DM members. Should there be need for a closed session it will be marked "DMLT only".

Day 1, Tuesday
Time (Project)	Topic	Coordinator	Pre-meeting notes	Running notes
Moderator: Wil O'Mullane	Notetaker: Frossie Economou
09:00	Welcome - Project news and updates	Wil O'Mullane		Project news etc ---------------- - reminder to onboard new hires - jira on cloud - needs co-ordination to ensure everybody's [inc syseng] plugins work Wil O'Mullane19 Oct 2022 to check with Ian for timelines on Jira Cloud - Future DMLTs - spring DMLT at the La Serena JTM - avoid new moon for future meetings [rhl] - June 12-14 - Oct 23-25 - chile JTM - reminder people need to arrive/early for summit trips
09:30 (DMLT only)	BPS plugin support/Processing USDF/Summit	Colin Slater Tim Jenness	Decide whether we have an opinion on summit batch and developer-driven processing (parsl vs condor vs panda). We are happy to support 2 bps plugins (probably not 3).	← See pre-meeting notes uploaded by richard - Do we need OCPS at all? [timj] Wil O'Mullane to organize discussion with rhl timj gpdf ktl etc re: OCPS 14 Nov 2022 - do we want to support these plugins outside usdf/summit [jb] Colin Slater to write spec for bake-off between htcondor and parsl-over-slurm including who/how to execute the bake-off 31 Oct 2022 - discussion for support model esp. re summit [CS/FE] support for the RSP model - clock is ticking priority on getting HTCondor BPS plugin working with new USDF HTCondor service- Tim Jenness 31 Oct 2022 After HTCondor get parsl/slurm updated before bakeoff (bps report, memory multiplier etc) - Tim Jenness 14 Nov 2022 Wil O'Mullane Report back on summit support model 14 Dec 2022 - after bake-off winner goes on summit - team to deploy it on CS's infrastructure TBD - parsl over k8s?
10:00	User Generated Data Catalogs	Fritz Mueller	Report on a general design and pieces required for support of user generated data products in the RSP, so work planning can begin.	User-Generated Data Products.pdf User-generated data products ---------------------------- - for temporary tables user only interacts with the TAP server [gpdf] - for temporary tables ~10K rows [gpdf] - temporary table plan is go - persistent tables - skyserver approach schema per use [wom] - will not let user specify schema order [gpdf] - Fritz Mueller and Gregory Dubois-Felsmann will outline an architecture in a technote
10:30	Break
Moderator: Frossie Economou	Notetaker: Kian-Tat Lim
11:00	Commissioning Cluster, Yagan, and do we have what we need?	Cristián Silva	Primary objective: Do we need to buy more nodes to increase the cluster's cores and ram? Secondary objective: What do we need to run in Yagan	Cristian Silva - DMLT vF2F October 2022.pdf RHL: Initial spec was 2 cores per CCD (400 cores). Right now not much being used out of 640 cores. New nodes will arrive 7 months after order. What is running where now? RSP, Sasquatch, and RubinTV run in yagan Control component CSCs run in yagan; Wil things this may be ~50 cores chonchon runs LFA amor runs LOVE Cristián Silva Get contributions to ITTN-014 to record everything running in yagan 21 Nov 2022 Most heavy compute like full focal-plane wavefront will be done at USDF, but rapid analysis will be done in this cluster. Power and cooling and space concerns? Far from space, close to limit of number of power sockets, but new UPS installed soon, so will have the ability to support more; limit is at least 1500 cores. Anything that happens at the Summit must happen at USDF (or be transported there) as well in order to satisfy users who will want the same data products. Can estimate based on number of parallel pipelines running in order to provide feedback to operators in near-realtime? Maybe 2-4 such? (Therefore 400-800 cores.) FAFF is defining these; they may be different than what is in Alert Production (e.g. PSF estimation). Must distinguish between metrics that "must be there" (or will waste telescope time) and "mostly needs to be there" (can come from Alert Production). USDF resources are also not particularly well-motivated. Jira is also an issue for Summit independence. Maybe start with an extra 400 to be safe, then adjust when we see what happens when ComCam goes on sky.
11:30	OPS planning and milestones	Wil O'Mullane		Link to Data Facilties google doc for epics planning (and used in the meeting for adding more milestones) Need to identify milestones that interact with CET, etc. Epics are activities and milestones are points in time. Only two milestones in DP right now: Ready for ComCam processing Ready for DP1 Appropriate to have milestone for "Live image data+metadata from ComCam exposed to RSP so that it is available via ObsTAP" (work is in Middleware). Can also add "DRP works at USDF" (blocks "Ready for ComCam"), "Alert Distribution works at USDF" (not a blocker for anything, equivalent to unclaimed Construction milestone). A number of other USDF-based but cross-team milestones from Richard. Distinguish between Ops work to operate systems and Ops work to build things not in Construction requirements; may affect which milestones to tie things to and whether to block milestones. Science Pipelines Ops work Porting to USDF Computational performance testing Analysis tools and RubinTV are "missing scope" Maintenance and evolution of delivered products (fgcm, piff) Figuring out Ops processes (and documenting them) may be considered an Ops activity; should have a milestone for defining these for ComCam Operations. Secure servers needed before end of 2023. All USDF work is considered Ops. Need to have a mechanism for caching of images at Cloud DF, but not an issue until DP2, and this is an optimization anyway. Frossie Economou Should have a test of how image access will work with Cloud DF using images from USDF, showing that Cloud DF is not significantly worse than going directly to USDF would be 31 Dec 2022 Wil O'Mullane to complete milestones in Jira 31 Oct 2022
12:15	Meeting free weeks 2023	Wil O'Mullane	Agree dates for meeting free weeks 2023 Spring :JTM - March 14-16 so March 20, Or April 3 or April 10 (around Easter) Summer: June 12 - 16 Autumn: Sept 25 - 29 Winter: Dec 25 - Jan 5	Should not overlap meeting-free weeks with reviews, as those people need the weeks the most. Perhaps alternate September week as Chilean holiday week (this year) with US Labor Day (2023-09-04). April week the week of (2023-04-10). June week the week of Juneteenth (2023-06-19). Wil O'Mullane 31 Oct 2022 Update dev guide to put in 2024 meeting free dates
12:30	Break
Moderator: Leanne Guy	Notetaker: Wil O'Mullane
1:00	Prompt Processing migration to the USDF	Eric Bellm	slides	Forging ahead at USDF - no longer supporting GCP. Tech selected - end to end demo not yet demostrated - ETA one month AuxTel perhaps end of year. Need developer system, CI and Deployment + configs for test/int/production RHL - what area are we prototyping so we can make templates. Ian spoke to Eric Deneihy about focusing on a specific area. Colin - Knative what is it ? Equivalent of google Cloud Run. Feeds jobs to workers, can spawn more and shutdown as needed. Each worker retrieves calibrations and templates while waiting. We may want more control over which requests go to which worker and when they startup/stop. Its kubernetes pods with our stack. KNative can do some translation between kafka and webhooks .. our code gets executed with context of next visit event, it then waits for image arrival and after that is all our code. Frossie - where is axix of pahttps://jira.lsstcorp.org/issues/?jql=labels%20%3D%20Prompt-processingrrellisaion? its per detector - its 189 indivdual wokers (more actually since one will not finish when next starts 378). Are they persistant ? Not sure KNative can do that - on cloud they were. Cold start is not terible - next image message is well in advance of image. This is not condor/slurm/parsl since we need to config for next visit - proposal to replace a lot of the back end with a batch job under panda instead of worker (technically a pilot job) Retries at Knative level - or we drop image and redo in morning (since 1 minute window). Richard OGA rack - only ceph object store and butler repo, APDB, long term storage all compute out side Steve - all 189 Knative workers at same time ? Should await webhook - but we would pre start them. Object store sends message to activator code which runs ingest image. Copy image to local store. Colin prep butler etc special - is the pipline a piptask ? Its SimplePipelineExecutor , exactly same pipline.yaml as elswhere - does not have to be AP could be anything. Leanne - how much is at USDF ? Have Kafka, JNatve wstarte, Ceph object store - some messaging missing and our pod not yet started in KNative. Simulation writing to store an next image etc need to be done . Almost ready to run when KNative pod can start - have an APDB set up. Ian - dev/comissioning right now we have 1 prompt prococessing system on a special server. Will we need a distinct dev system to avoid miessing up ?> DO we need one per camera .. KT: SOme parts may be shareable, will need dev vs prod end points. Need endpoints per instrument. Ian what we have now is dev and we will need a frozen version for AuxTel. Could deploy mulitple end points for KNative in single K5S cluster. Colin - please have Dev/Int/Prod cluster Need to deploy at USDF, TTS and Summit Frossie asks its prompt ? KT it could be used for rapid analysis Frossie: TTS is already busy and we can stomp on each other easily - may need more disk etc to deploy this there..
1:45	close

NB: IVOA meeting "DAL" session, mainly on ADQL, is 13:30-14:30 PT on Tuesday 2022-10-18, partially overlapping the planned "Prompt Processing migration to the USDF" session.

Day 2, Wednesday
Moderator: Wil O'Mullane	Notetaker: Yusra AlSayyad
9:00	Status and plans: DAX SQuARE DRP IT DevOPs Arch AP NCSA DM Science	Fritz Mueller Frossie Economou Yusra AlSayyad Cristián Silva Kian-Tat Lim Ian Sullivan Steve Pietrowicz Leanne Guy	>>> presentors = ['Leanne', 'Cristian', 'Frossie', 'KT', 'Fritz', 'Stephen', 'Ian', 'Yusra'] >>> random.shuffle(presentors) >>> presenters.remove('Fritz') >>> presenters.append('Fritz') (no?) ['Fritz', 'Frossie', 'Yusra', 'Cristian', 'KT', 'Ian', 'Stephen', 'Leanne'] DAX F22/S23 Status_Plans.pdf Cristian Silva - DMLT vF2F October 2022.pdf Arch F23A Status and Plans.pdf DM Science Retrospective and Plans F22AB NCSA DMLT F2F Oct 2022.pdf	K-T: Is it possible to pay CADC? Frossie: No, but it is possible to trade some of our systems with some other systems. YA: Been announcing since March, but if you didn’t get the memo with Scarlet Lite’s computational performance improvements, we’re not worried about having resources to run it. Wil: Does this mean we can retire that risk? YA: No, because that risk was about ALL algorithms, and just because we’re not worried about Scarlet anymore doesn’t mean that we’re not worried about things like galaxy measurement. We still have more algos than we have compute for, and there will be hard tradeoffs to be made. Wil: And Andres is going to give a talk in Spanish for us next week. Colin: The backup link, that’s microwave from the summit to where? Cristian: Just summit to base. It’ll be invisible. When the fiber goes down, we change the priority to wireless. It’s only for control data, not science data. Wil: We had a network meeting with SLAC last week. Are they OK with the routers? Cristian: Yep. We need a list of what needs to be reachable from Rubin. Wil: The question from (didn't catch the name?) about the suitability of the routers for installation? Cristian: Oh, we didn’t get to that. We gave him the specs. It’s a router behind our routers at the summit. Another at SLAC. one for the 100, one for the 40 Richard Dubois Confirm on your side that SLAC is OK with the router specs? 03 Nov 2022 Wil: Cristian, we should talk with Kevin and give him a delivery time so that he can put in P6 a date that it will be in place. Then we can communicate with NSF about something happening. We have a milestone but need an activity. Wil: Good about the Postgres. Tiago was showing me this serverless Postgres he set up, and I told him that you have something like that with Kubernetes that he can use. Richard: Do you need help from the Fermilab DBA folks to set up the Postgres on Kubernetes? Cristian. Sure! The servers are already running. Who do I write? Richard: <not taker didn’t catch the name>. Wil: We’ll want the help once everyone is adding their schemas. K-T: to the extent that they are using the butler, it’s just one schema. I think it’s up to Steve and me to migrate the SQlite DB for the OODS over to Postgres. Got bogged down. Steve: Yeah, I’m in contact with Bruno about that. Frossie: K-T, What was the metrics thing? K-T: Thinking about how to get metrics into Sasquatch. To my knowledge, the way it’s done down is outside of the pipeline. Frossie: Yeah, but Angelo is working on it now. Richard: Jenkins workers running in the cloud… what’s the need to move it to SLAC? K-T: no requirement, but it would allow us to do things Merlin asked for. He wants to run pipeline tests against live butler repos. We never had it at NCSA, even though we could have. Frossie: Those jobs were run at the DF so that we could run deep tests with the production system. K-T depends on how much we want to own everything Cristian: Is LFA replication bi-directional? K-T: No Cristian: we could use our multi-site replication! Wil: Let’s punt this to a future discussion. Kian-Tat Lim Convene a meeting following up vF2F discussion about what to do with Jenkins 24 Nov 2022
10:30			Break
Moderator: Wil O'Mullane	Notetaker: Ian Sullivan
11:00	Needs for "tactical" databases	Robert Lupton	See RHL's rough notes	FE: There are many producers and many consumers of metrics. Each has different needs Elephant in the room is the consolidated database. Metrics were going to be one of the things persisted with the Butler Stuff that needs to go from USDF to the summit needs to go through Sasquatch Stuff that needs to be presented to the user needs to be in the consolidated database RHL: thinks there could be other ways of moving stuff from USDF to the summit this is a proposal for what we need, which might be what the consolidated database may be FE: We have tooling other than Chronograf on the summit, also have notebooks KTL: Rucio is another method to transfer (calibration) datasets from the USDF to the summit Any time series-only data should go in the database JB: interested in looking into the database keys that are explicitly time oriented. WOM: should follow up outside this meeting FE: Concern about camera data that is not in DM-land. RHL: Interested in implementation here. The camera team has tools, how do we use them? WOM: We can port the tools, and make the queries talk to influx DB. FE: Kafka plus influx is Sasquatch. Is it true that camera telemetry scalars are going in the EFD? RHL: Almost WOM: will be down there soon, and can assist with the port if needed. KTL: This mostly deals with the keys, not the values. Those will change over time, and will require talking with the scientists FE: I don’t see how this schema gets defined without GPDF. We need data engineers GPDF: I don't need to be involved in the definition of each table, but yes to deciding what keys are time varying etc. WOM: Once we have postgres at the summit, we should start putting schemas in it JB: What kind of skills does the person setting up the schema at the summit need to have? WOM: that is what I am trying to work out. CS: The need is for someone to understand the conditions at the summit WOM: That's Patrick, Tiago, Erik, Robert Question is whether we can make use of general knowledge of people available at Fermilab Kian-Tat Lim Determine whether people at Fermilab can assist with setting up schemas for the summit 02 Nov 2022 Robert Lupton Identify point of contact on commissioning side for summit schema 26 Oct 2022 Robert Lupton Kian-Tat Lim Kick off meeting between commissioning, Fermilab, and summit (Carlos) for summit schema 09 Nov 2022
11:30 (DMLT only)	Wrap Up/Actions/AOB/	Wil O'Mullane		YA: Campaign management reorganization (slides) LG: Why are Colin and Yusra interim product owner and group lead? YA: if someone signs up for the campaign management team, I want to leave these roles open to them. LG: might be good to write current, not interim YA: hoping that someone will step up and want to do the group lead role. Agree current is a better term KT: pilot/co-pilot rotation sounds like a great idea, but people are not necessarily replaceable in what they were working on before. Is it expected that people will pause other work, and it will be restarted when they return from campaign management? YA: I don't want people to feel trapped in this role, which can easily happen. RD: Gotta run. We discussed this at the retreat so I’m ok with it. Remember there can be members from Europe. We also worried that 100% would indeed burn people out. EB: There may be distinctions between how this looks for DRP/AP/commissioning. WOM: Frossie has some org charts, we will write it up with more detail GPDF: very supportive. Looks a lot like the mechanisms we set up eventually with BABAR. 100% agree this tends to burn people out; there is a small subset of people who enjoy it. agree with EB that there will be significant differences in rhythm between AP and DRP LG: question for Colin: do you see different people in V&V rotating through as well? CS: This is not a position I intent to sit in to hold it, would be happy for someone else to step up as product owner, but it is too important to leave empty at first LG: I think it would be very helpful to have V&V team members rotate through, to spread experience. WOM: really like the idea of more people having experience organizing/leading. Builds resiliance FE (slides) This plan that YA proposed came out of DPP retreat last month. Campaign management now part of Data Production under YA, next to Algorithms&Pipelines Qserv now in Data Services under FE FE: format of this meeting Propose that we upload status slides in advance, use the timeslot for questions and discussion (standup style) FM: finds it useful to write/present plan in response to the discussions from DMLT, would find it less useful to prepare in advance JB: slides written to be read are different than ones written for presentation, since that allows greater clarification WOM: We used to reserve some time at the end for discussion WOM: suggestion is to start last session later, so that people have time to finish their presentations. Provide drafts in advance. Provide time for reading the presentations, and schedule time (~1hr) for discussion afterwards.
12:30	Close

Proposed Topics

Topic	Requested by	Time required (estimate)	Notes
Prompt Processing migration to the USDF	Eric Bellm	45 minutes?	Prefer scheduling after lunch Tuesday
Meeting free weeks 20223	Wil O'Mullane	15m	Agree dates for meeting free weeks 2023 Spring :JTM - March 14-16 so March 20, Or April 3 or April 10 (around Easter) Summer: June 12 - 16 Autumn: Sept 25 - 29 Winter: Dec 25 - Jan 5
User Generated Data Products	Fritz Mueller	30m
Commissioning Cluster, Yagan, and do we have what we need?	Cristián Silva	30m or less
Processing frameworks at USDF	Colin Slater	30m	Re: developer-driven processing. Closed-session please. CTS: I think this is the same as what Tim is proposing below
OPS planning and milestones	@womullan	1h	See the epics we have and what milestones make sense - block milestones with epics.
BPS plugin support	Tim Jenness		Decide whether we have an opinion on summit batch and developer batch (parsl vs condor vs panda). We are happy to support 2 bps plugins.

Logistics

Date

Day 1, Tuesday

Day 2, Wednesday

Proposed Topics

Attached Documents

Action Item Summary