Logistics

Date 

2022 October 18 and 19

Location

This meeting will be  virtual;

Join Zoom Meeting (DMLT link)

Meeting: https://noirlab-edu.zoom.us/j/93412065536?pwd=QWhvb29kbVI5NEZNWFMrR0dzN0RBZz09

Meeting ID: 934 1206 5536

Password: 752226


Attendees:


All technical sessions of DMLT vF2F will be open to DM members.   Should there be need for a closed session it will be marked "DMLT only".

Day 1, Tuesday

Time (Project)TopicCoordinatorPre-meeting notesRunning notes

Moderator: Wil O'Mullane 

Notetaker: Frossie Economou 

09:00Welcome - Project news and updates

Project news etc
----------------

- reminder to onboard new hires
- jira on cloud
  - needs co-ordination to ensure everybody's [inc syseng] plugins work
 

  • Wil O'Mullane   to check with Ian for timelines on Jira Cloud

- Future DMLTs
  - spring DMLT at the La Serena JTM
  - avoid new moon for future meetings [rhl]
  - June 12-14
  - Oct 23-25
  
- chile JTM
  - reminder people need to arrive/early for summit trips

09:30 (DMLT only)BPS plugin support/Processing USDF/Summit

Decide whether we have an opinion on summit batch and developer-driven processing (parsl vs condor vs panda). We are happy to support 2 bps plugins (probably not 3).


← See pre-meeting notes uploaded by richard

  • - Do we need OCPS at all? [timj]
  •  Wil O'Mullane  to organize discussion with rhl timj gpdf ktl etc re: OCPS  

- do we want to support these plugins outside usdf/summit [jb]


  • Colin Slater to write spec for bake-off between htcondor and parsl-over-slurm including who/how to execute the bake-off  

- discussion for support model esp. re summit [CS/FE] support for the RSP model - clock is ticking

  • priority on getting HTCondor BPS plugin working with new USDF HTCondor service- Tim Jenness
  • After HTCondor get parsl/slurm updated before bakeoff (bps report, memory multiplier etc) - Tim Jenness  

  • Wil O'Mullane Report back on summit support model  

- after bake-off winner goes on summit
- team to deploy it on CS's infrastructure TBD

- parsl over k8s?

10:00User Generated Data CatalogsReport on a general design and pieces required for support of user generated data products in the RSP, so work planning can begin.

User-Generated Data Products.pdf

User-generated data products
----------------------------

- for temporary tables user only interacts with the TAP server [gpdf]
- for temporary tables ~10K rows [gpdf]
- temporary table plan is go
- persistent tables 
  - skyserver approach schema per use [wom]
  - will not let user specify schema order [gpdf]
  - Fritz Mueller  and Gregory Dubois-Felsmann  will outline an architecture in a technote

10:30Break

Moderator: Frossie Economou 

Notetaker: Kian-Tat Lim 

11:00Commissioning Cluster, Yagan, and do we have what we need?Primary objective: Do we need to buy more nodes to increase the cluster's cores and ram? Secondary objective: What do we need to run in Yagan

Cristian Silva - DMLT vF2F October 2022.pdf

RHL: Initial spec was 2 cores per CCD (400 cores).

Right now not much being used out of 640 cores.

New nodes will arrive 7 months after order.

What is running where now?

  • RSP, Sasquatch, and RubinTV run in yagan
  • Control component CSCs run in yagan; Wil things this may be ~50 cores
  • chonchon runs LFA
  • amor runs LOVE
  • Cristián Silva Get contributions to ITTN-014 to record everything running in yagan  

Most heavy compute like full focal-plane wavefront will be done at USDF, but rapid analysis will be done in this cluster.

Power and cooling and space concerns? Far from space, close to limit of number of power sockets, but new UPS installed soon, so will have the ability to support more; limit is at least 1500 cores.

Anything that happens at the Summit must happen at USDF (or be transported there) as well in order to satisfy users who will want the same data products.

Can estimate based on number of parallel pipelines running in order to provide feedback to operators in near-realtime?  Maybe 2-4 such? (Therefore 400-800 cores.)  FAFF is defining these; they may be different than what is in Alert Production (e.g. PSF estimation).

Must distinguish between metrics that "must be there" (or will waste telescope time) and "mostly needs to be there" (can come from Alert Production).

USDF resources are also not particularly well-motivated.

Jira is also an issue for Summit independence.

Maybe start with an extra 400 to be safe, then adjust when we see what happens when ComCam goes on sky.

11:30OPS planning and milestones


Link to Data Facilties google doc for epics planning (and used in the meeting for adding more milestones)

Need to identify milestones that interact with CET, etc.

Epics are activities and milestones are points in time.

Only two milestones in DP right now:

  • Ready for ComCam processing
  • Ready for DP1 

Appropriate to have milestone for "Live image data+metadata from ComCam exposed to RSP so that it is available via ObsTAP" (work is in Middleware).  Can also add "DRP works at USDF" (blocks "Ready for ComCam"), "Alert Distribution works at USDF" (not a blocker for anything, equivalent to unclaimed Construction milestone).

A number of other USDF-based but cross-team milestones from Richard.

Distinguish between Ops work to operate systems and Ops work to build things not in Construction requirements; may affect which milestones to tie things to and whether to block milestones.

Science Pipelines Ops work

  • Porting to USDF
  • Computational performance testing
  • Analysis tools and RubinTV are "missing scope"
  • Maintenance and evolution of delivered products (fgcm, piff)

Figuring out Ops processes (and documenting them) may be considered an Ops activity; should have a milestone for defining these for ComCam Operations.

Secure servers needed before end of 2023.

All USDF work is considered Ops.

Need to have a mechanism for caching of images at Cloud DF, but not an issue until DP2, and this is an optimization anyway.

  • Frossie Economou Should have a test of how image access will work with Cloud DF using images from USDF, showing that Cloud DF is not significantly worse than going directly to USDF would be  
  • Wil O'Mullane to complete milestones in Jira  
12:15Meeting free weeks 2023

Agree dates for meeting free weeks 2023

  • Spring :JTM - March 14-16 so March 20,   Or April 3 or April 10 (around Easter)
  • Summer:  June 12 - 16
  • Autumn:  Sept 25 - 29
  • Winter: Dec 25 - Jan 5

Should not overlap meeting-free weeks with reviews, as those people need the weeks the most.

Perhaps alternate September week as Chilean holiday week (this year) with US Labor Day (2023-09-04).

April week the week of (2023-04-10).

June week the week of Juneteenth (2023-06-19).

  • Wil O'Mullane   Update dev guide to put in 2024 meeting free dates
12:30Break

Moderator: Leanne Guy 

Notetaker: Wil O'Mullane 

1:00

Prompt Processing migration to the USDFslides

Forging ahead at USDF - no longer supporting GCP.

Tech selected - end to end demo not yet demostrated - ETA one month

AuxTel perhaps end of year.

Need developer system, CI and Deployment + configs for test/int/production

RHL - what area are we prototyping so we can make templates. Ian spoke to Eric Deneihy about focusing on a specific area.

Colin - Knative what is it ?  Equivalent of google Cloud Run. Feeds jobs to workers, can spawn more and shutdown as needed. Each worker retrieves calibrations and templates while waiting. We may want more control over which requests go to which worker and when they startup/stop. Its kubernetes pods with our stack. KNative can do some translation between kafka and webhooks .. our code gets executed with context of next visit event, it then waits for image arrival and after that is all our code.

Frossie - where is axix of pahttps://jira.lsstcorp.org/issues/?jql=labels%20%3D%20Prompt-processingrrellisaion?  its per detector - its 189 indivdual wokers (more actually since one will not finish when next starts 378). Are they persistant ? Not sure KNative can do that - on cloud they were. Cold start is not terible - next image message is well in advance of image.

This is not condor/slurm/parsl since we need to config for next visit - proposal to replace a lot of the back end with a batch job under panda instead of worker (technically a pilot job)

Retries at Knative level - or we drop image and redo in morning (since 1 minute window).

Richard OGA rack - only ceph object store and butler repo, APDB,  long term storage all compute out side

Steve - all 189 Knative workers at same time ?  Should await webhook - but we would pre start them. Object store sends message to activator code which runs ingest image. Copy image to local store.

Colin prep butler etc special - is the pipline a piptask ? Its SimplePipelineExecutor , exactly same pipline.yaml as elswhere - does not have to be AP could be anything.


Leanne - how much is at USDF ?  Have Kafka, JNatve wstarte, Ceph object store - some messaging missing and our pod not yet started in KNative. Simulation writing to store an next image etc need to be done .  Almost ready to run when KNative pod can start - have an APDB set up.


Ian - dev/comissioning right now we have 1 prompt prococessing system on a special server. Will we need a distinct dev system to avoid miessing up ?> DO we need one per camera ..

KT: SOme parts may be shareable, will need dev vs prod end points. Need endpoints per instrument.

Ian what we have now is dev and we will need a frozen version for AuxTel. Could deploy mulitple end points for KNative in single K5S cluster.

Colin - please have Dev/Int/Prod cluster Need to deploy at USDF, TTS and Summit

Frossie asks its prompt ?  KT it could be used for rapid analysis

Frossie: TTS is already busy and we can stomp on each other easily - may need more disk etc to deploy this there..


1:45close



Day 2, Wednesday

Moderator: Wil O'Mullane 

 Notetaker: Yusra AlSayyad 

9:00

Status and plans:

DAX

SQuARE

DRP

IT DevOPs

Arch

AP

NCSA

DM Science

>>> presentors = ['Leanne', 'Cristian', 'Frossie', 'KT', 'Fritz', 'Stephen', 'Ian', 'Yusra']

>>> random.shuffle(presentors)

>>> presenters.remove('Fritz')
>>> presenters.append('Fritz')

(no?)

['Fritz', 'Frossie', 'Yusra', 'Cristian', 'KT', 'Ian', 'Stephen', 'Leanne']

DAX F22/S23 Status_Plans.pdf

Cristian Silva - DMLT vF2F October 2022.pdf

Arch F23A Status and Plans.pdf

DM Science Retrospective and Plans F22AB

NCSA DMLT F2F Oct 2022.pdf

  • K-T: Is it possible to pay CADC?
  • Frossie: No, but it is possible to trade some of our systems with some other systems. 


  • YA: Been announcing since March, but if you didn’t get the memo with Scarlet Lite’s computational performance improvements, we’re not worried about having resources to run it.
  • Wil: Does this mean we can retire that risk?
  • YA: No, because that risk was about ALL algorithms, and just because we’re not worried about Scarlet anymore doesn’t mean that we’re not worried about things like galaxy measurement. We still have more algos than we have compute for, and there will be hard tradeoffs to be made. 

Wil: And Andres is going to give a talk in Spanish for us next week. 

  • Colin: The backup link, that’s microwave from the summit to where?
  • Cristian: Just summit to base. It’ll be invisible. When the fiber goes down, we change the priority to wireless. It’s only for control data, not science data. 


  • Wil: We had a network meeting with SLAC last week. Are they OK with the routers?
  • Cristian: Yep. We need a list of what needs to be reachable from Rubin.
  • Wil: The question from (didn't catch the name?) about the suitability of the routers for installation?
  • Cristian: Oh, we didn’t get to that. We gave him the specs. It’s a router behind our routers at the summit. Another at SLAC. one for the 100, one for the 40
    • Richard Dubois Confirm on your side that SLAC is OK with the router specs?  


Wil: Cristian, we should talk with Kevin and give him a delivery time so that he can put in P6 a date that it will be in place. Then we can communicate with NSF about something happening. We have a milestone but need an activity. 


  • Wil: Good about the Postgres. Tiago was showing me this serverless Postgres he set up, and I told him that you have something like that with Kubernetes that he can use.
  • Richard: Do you need help from the Fermilab DBA folks to set up the Postgres on Kubernetes?
  • Cristian. Sure! The servers are already running. Who do I write?
  • Richard: <not taker didn’t catch the name>.
  • Wil: We’ll want the help once everyone is adding their schemas.
  • K-T: to the extent that they are using the butler, it’s just one schema. I think it’s up to Steve and me to migrate the SQlite DB for the OODS over to Postgres. Got bogged down.
  • Steve: Yeah, I’m in contact with Bruno about that. 


  • Frossie: K-T, What was the metrics thing?
  • K-T: Thinking about how to get metrics into Sasquatch. To my knowledge, the way it’s done down is outside of the pipeline.
  • Frossie: Yeah, but Angelo is working on it now. 


  • Richard: Jenkins workers running in the cloud… what’s the need to move it to SLAC?
  • K-T: no requirement, but it would allow us to do things Merlin asked for. He wants to run pipeline tests against live butler repos. We never had it at NCSA, even though we could have.
  • Frossie: Those jobs were run at the DF so that we could run deep tests with the production system.
  • K-T depends on how much we want to own everything
  • Cristian: Is LFA replication bi-directional?
  • K-T: No
  • Cristian: we could use our multi-site replication!
  • Wil: Let’s punt this to a future discussion. 
      •  Kian-Tat Lim  Convene a meeting following up vF2F discussion about what to do with Jenkins  
10:30

Break
Moderator: Wil O'Mullane Notetaker: Ian Sullivan 


11:00Needs for "tactical" databasesRobert Lupton See RHL's rough notes
  • FE: There are many producers and many consumers of metrics. Each has different needs
    • Elephant in the room is the consolidated database. Metrics were going to be one of the things persisted with the Butler
    • Stuff that needs to go from USDF to the summit needs to go through Sasquatch
    • Stuff that needs to be presented to the user needs to be in the consolidated database
  • RHL: thinks there could be other ways of moving stuff from USDF to the summit
    • this is a proposal for what we need, which might be what the consolidated database may be
  • FE: We have tooling other than Chronograf on the summit, also have notebooks
  • KTL: Rucio is another method to transfer (calibration) datasets from the USDF to the summit
    • Any time series-only data should go in the database
  • JB: interested in looking into the database keys that are explicitly time oriented.
    • WOM: should follow up outside this meeting
  • FE: Concern about camera data that is not in DM-land.
  • RHL: Interested in implementation here. The camera team has tools, how do we use them?
    • WOM: We can port the tools, and make the queries talk to influx DB.
  • FE: Kafka plus influx is Sasquatch. Is it true that camera telemetry scalars are going in the EFD?
    • RHL: Almost
    • WOM: will be down there soon, and can assist with the port if needed.
  • KTL: This mostly deals with the keys, not the values. Those will change over time, and will require talking with the scientists
  • FE: I don’t see how this schema gets defined without GPDF. We need data engineers
  • GPDF: I don't need to be involved in the definition of each table, but yes to deciding what keys are time varying etc.
  • WOM: Once we have postgres at the summit, we should start putting schemas in it
  • JB: What kind of skills does the person setting up the schema at the summit need to have?
    • WOM: that is what I am trying to work out.
    • CS: The need is for someone to understand the conditions at the summit
    • WOM: That's Patrick, Tiago, Erik, Robert
      • Question is whether we can make use of general knowledge of people available at Fermilab
  • Kian-Tat Lim Determine whether people at Fermilab can assist with setting up schemas for the summit  
  • Robert Lupton Identify point of contact on commissioning side for summit schema  
  • Robert Lupton Kian-Tat Lim Kick off meeting between commissioning, Fermilab, and summit (Carlos) for summit schema  



11:30 (DMLT only)Wrap Up/Actions/AOB/
  • YA: Campaign management reorganization (slides)

      • LG: Why are Colin and Yusra interim product owner and group lead?
        • YA: if someone signs up for the campaign management team, I want to leave these roles open to them.
        • LG: might be good to write current, not interim
        • YA: hoping that someone will step up and want to do the group lead role. Agree current is a better term
      • KT: pilot/co-pilot rotation sounds like a great idea, but people are not necessarily replaceable in what they were working on before. Is it expected that people will pause other work, and it will be restarted when they return from campaign management?
      • YA: I don't want people to feel trapped in this role, which can easily happen.
      • RD: Gotta run. We discussed this at the retreat so I’m ok with it. Remember there can be members from Europe. We also worried that 100% would indeed burn people out.
      • EB: There may be distinctions between how this looks for DRP/AP/commissioning.
        • WOM: Frossie has some org charts, we will write it up with more detail
      • GPDF: very supportive. Looks a lot like the mechanisms we set up eventually with BABAR. 100% agree this tends to burn people out; there is a small subset of people who enjoy it.
        • agree with EB that there will be significant differences in rhythm between AP and DRP
      • LG: question for Colin:
        • do you see different people in V&V rotating through as well?
        • CS: This is not a position I intent to sit in to hold it, would be happy for someone else to step up as product owner, but it is too important to leave empty at first
        • LG: I think it would be very helpful to have V&V team members rotate through, to spread experience.
          • WOM: really like the idea of more people having experience organizing/leading. Builds resiliance
      • FE (slides)
        • This plan that YA proposed came out of DPP retreat last month.
          • Campaign management now part of Data Production under YA, next to Algorithms&Pipelines
          • Qserv now in Data Services under FE
  • FE: format of this meeting
    • Propose that we upload status slides in advance, use the timeslot for questions and discussion (standup style)
    •  FM: finds it useful to write/present plan in response to the discussions from DMLT, would find it less useful to prepare in advance
    • JB: slides written to be read are different than ones written for presentation, since that allows greater clarification
    • WOM: We used to reserve some time at the end for discussion
    • WOM: suggestion is to start last session later, so that people have time to finish their presentations. Provide drafts in advance. Provide time for reading the presentations, and schedule time (~1hr) for discussion afterwards.
12:30Close



Proposed Topics

TopicRequested byTime required (estimate)Notes
Prompt Processing migration to the USDF45 minutes?Prefer scheduling after lunch Tuesday
Meeting free weeks 2022315m

Agree dates for meeting free weeks 2023

  • Spring :JTM - March 14-16 so March 20,   Or April 3 or April 10 (around Easter)
  • Summer:  June 12 - 16
  • Autumn:  Sept 25 - 29
  • Winter: Dec 25 - Jan 5
User Generated Data Products30m
Commissioning Cluster, Yagan, and do we have what we need?30m or less
Processing frameworks at USDF30m

Re: developer-driven processing. Closed-session please.

CTS: I think this is the same as what Tim is proposing below

OPS planning and milestones@womullan1hSee the epics we have and what milestones make sense - block milestones with epics.
BPS plugin support
Decide whether we have an opinion on summit batch and developer batch (parsl vs condor vs panda). We are happy to support 2 bps plugins.



Attached Documents


Action Item Summary