2022 February 15-17
This meeting will be virtual;
Join Zoom Meeting
- Aime Brown Wiest create a Zoom meeting please for Feb 15-17 - put info here
Meeting ID: 933 6817 3995
- Frossie Economou
- Unknown User (mbutler)
- Kian-Tat Lim
- Steve Pietrowicz
- Tim Jenness
- Jim Bosch
- Fritz Mueller
- Yusra AlSayyad
- Colin Slater
- Eric Bellm
- Robert Lupton
- Gregory Dubois-Felsmann
- Unknown User (npease)
- Wil O'Mullane
- Zeljko Ivezic
- Ian Sullivan
Day 1, Tuesday Feb 15 2022
|Time (Project)||Topic||Coordinator||Pre-meeting notes||Running notes|
Moderator: Yusra AlSayyad
Notetaker: Ian Sullivan
|09:00||Welcome - Project news and updates|
Missing Functionality as tickets: Getting issues...
|10:00||QA||Quality of QA additions to lsst_distrib e.g. inability to run at IN2P3|
|10:00||Build system||Do we want any changes as we reimplement the build system at SLAC? How can we improve reliability and maintainability? What outputs are needed that we don't currently have (e.g. images from CI runs)? Can we simplify and combine any existing processes (e.g. separate lsst_distrib, ci_hsc, ci_imsim, nightly-release runs)?|
Need to stand up a brand-new instance of Jenkins at SLAC
Moderator: Wil O'Mullane
Notetaker: Kian-Tat Lim
|11:00||big files and little files and little pieces of big files||slides to kick off discussion|
|11:20||Little tasks vs big tasks||Brought up in the morning session and relevant to previous topic. News that e.g. preemptable queues cost 4X non-premptable queues affect the tradeoffs in pipelines design.|
|11:45||Image services (esp. for DP0.2)||Gregory Dubois-Felsmann / Frossie Economou|
Define the remaining missing pieces needed to deliver acceptable image services for DP0.2 - and beyond - and ensure that they are all assigned to specific groups/T-CAMs
Slides for discussion
(See also slides from June 2021 DMLT)
Moderator: Leanne Guy
|Notetaker: Frossie Economou|
|Please note polls for June and Oct meetings||June: https://doodle.com/poll/ggwsy4hvm7y62ywc?utm_source=poll&utm_medium=link|
|Status of Prompt Processing||DMTN-219|
Moved to 10:00
Day 2, Wednesday Feb 16 2022
Moderator: Leanne Guy
Slides cover details of orgcharts and planning for USDF. Ref RTN-021 and DMTN-189. Assumptions slide quite good.
SDF currently shows 7-10K cores available (of 15K) over the last week.
RHL: HSC data is RC 2 - Yes Brandon will bring it Jim idientifed it
EB: Whats distinction between Batch and K8s. RD is a palatee of tools we are trying not to tie it down - LLS will have a petaflop and much of the time will only use 30% of it. We can interface to Condor, etc .. we can have a prio set of machines for RUbin - then spill out to others.
RHL: Support of Latiss Taking data already every month - Prompt being set up for it? RD: had not heard of it.
KT: Anticipating Prompt would be at USDF on K8S .. wheter its operational and routine may take until Oct. But there should be something. FOr Eric - for devs typically they will log in a similar dev node in USDF. RD: open to whats bewt for dev.
CS: Login node is just authentication - no one cares. We care about the compute node, RD: SDF login nodes are beefier and could be used as development nodes. SP: NCSA login nodes are only jump of to dev modes but the HPC nodes are submission nodes to batch. RD: others have had dedicated developer nodes - we can optomise together.
pdf - Richard
I've heard slightly conflicting reports about whether PanDA is definitely what we'll use for batch processing vs. something we're still evaluating (and if the latter, who is deciding, and when). I have some guesses about what might make us prefer it over HTCondor (or Parsl?), but they really are just guesses. I'd like to understand what the roadmap is here, both for long term use in operations and what our developers will be using after the move from NCSA to SLAC.
Path to processing - this is what Fritz is doing now.
CS: Org chart .. who is responsible for nightly processing on there? RD: Execution team ?
And when it goes wrong ? Alerts fall over - they should know and call the correct. person.
Not intending to run shifts at SLAC - but we are hoping to use European effort for out of hours - they would need training.
RHL: Feb looked very interesting (Fritz end of Feb is almost April) so when is fist RC2 ? FM: cobling some bits together with JimC - using DC2 data already at SLAC - so end feb indeed bu tit would not be the regular processing. DP0.2 peopl are shaking out a lot of PanDA issues and we want to take advantage of that - HSFC will also be back in April so perhpas about then. Prototype Feb, make robust in March - could have something in April (no promise)
RHL:Campaign management - RD: is staffing up now - so come back in May.. we are nowhere on this
KTL: Rucio - we do want to start exec across 3DFs at some point so we need rucio but not in the initial runs (which would be like NCSA). What did Colin originally expect for execution ? DRP and AP are toe two big components often concurrent - RD:execution has 8-10 people in ops but not now.
FE:Several inKinds offered out of hours .. Japanese perhaps. RD: have been trying to ping Phil to see what kind of inKinds we can get.
YA: Colin are you worried keeping all the processing running belongs elsewhere only that its too small ? CS: there are boxes for responsibilities - would like ot see alerts and DRP separate. YA: but its one group now.
YA: thought there were shifts for exec. RD: we are hoping to cover it with europeans etc WOM: notes he set up some of this and there are identified Alert and DRP people in Exec. Will talk with HFC when she returns on how this is orgnaised exactly.
JB: adopted panda since we anted to orchestrate acorss many sites and its the only game, but PARSL and Pegasus are more DAG oriented: KTL looking at wayy to build graphs and submit to sites and push up to campaign management but for now still need PanDA features.
LG: any plans to test Parsl lliek a DP0.2 test? RD: not aware of any - DESC are doing there reprocessing DC2 with PARSL. Also CC-IN2P3 are doing DC2 with PARSL - PARSL has not done large scale processing, multisite week and understnading whats going on is weak. TJ: keeping BPS general..Condor is tested .. Panda is tested .. SP: NCSA poepel meet with HTCondor monthly – they are set to help out. Wei: HTCondor and PanDA ar enot same level of thing .. Condor more at batch level can use something for workflows - PanDA does this. You could use HTcondor at SLAC.
JB: BPS HTcondor Plugin is somehting we should have and PanDa BPS was at wrong level ? TJ: no PanDA needs to be told what the DAG is etc - it could drive Condor. Persite config of PanDA for
|10:00||No-meeting holiday period||Eric Bellm||review experience with December-January no-meeting period|
MB: loved it - started early ... and went long people were ready to back on 8th.
KT: Need to decide the number of weeks
FE: delighted : Agree with MB - 2 weeks more often may be better. 4 weeks too long for missing meetings also
YA: AAS meeting free week was exceptional - mentioned Ops initiative (above)
LG: indeed AAS was special case - 2 weeks would be great. CERN gives extr days on top of vacation. Woudl impove productivity ..
FM: team liked it - SLAC shut down is good was great to have it broader DM. 3rd week was great 4th week was a bit long -2 weeks is a good min.
IS: some negative feedback - mainly on messaging. Developers who do not have that many meetings were worried it was enforced quiete period - a little unhappy about cancelled meetings as its a checkn and chance to see people: Personally liked it but was also sick. Within local group we should have kept some meetings like metrics ..missed some things
JB: Important to cancel meetings not just push them off. Focus on long term projects and skip maintenance for a while
RHL: in favor of this. Piling up end of year (Noir people take unsed vacations) - 2 weeks of Christmas and 2 weeks another time would be better
FE: COVID made the use iot or loose it worse. . Disappointed we did not get the messaging correct - was not focus friday and not a meeting ban - people still met .. Aquare standups people ipted in. People got more of Frossie's time on specific projects/devs.
CS: was confused . FE: Simple heristic - if I look at Feb 2023 there are a bunch of meetings - those are the ones we want to kill off ..
KTL: not being able to go away because you need to report etc - does'nt this mean we have a problem with delegation?
FE: decision process may be the problem - forces use of RFC etc .. pushing more to geekbot. Better than other porjecs but still poor ..
CS: decision tree process for meetings or not but what about others.
WOM: To suamrise - 2 weeks is the answer - perhpas 2 times a ear and another 2 1 week periods. Communication on that can be improved Colin/Ian should challenge me more on the wording next time..
Moderator: Frossie Economou
Notetaker: Gregory Dubois-Felsmann
as we did last time - look over overdue and near term milestones .
Also Milestones LCR .. I think i have it tidied:
DMTN-158 updated as of this morning. Milestone retirement is still on its old trajectory, though the scheduled "cliff" has fallen away below us. Need to do better.
DM-AP-10 (SFM2): Ian Sullivan waiting for analysis needed in order to determine whether ready to claim. Affected by departure of Chris M. Robert Lupton Not a formality? Doesn't this link to whether we need full-focal-plane astrometry? => Leave as-is for now.
DM-AP-12 (DIFF3): Ian Sullivan actively working on this one, still expect to complete before end of year, more likely 3-6 months => Revisit at next DMLT.
DM-AP-14 (ALERTDIST2): Eric Bellm Pending decision on whether we can claim external brokers (esp. Antares) as fulfilling this. Need to understand how to satisfy SRD requirement. Do we "claim it" or "descope it" in order to retire it? Zeljko Ivezic Need to flow resolution through the PST.
DM-AP-15 (INTEGRATION1): Eric Bellm Spencer has put together a full system in the IDF that brokers have actually connected to. Not at USDF and not connected to an actual AP pipeline. DM-AP-16 requires USDF. Frossie Economou advocates claiming this based on the IDF deployment. LCR the wording to say "at a Data Facility" to cover this.
DM-AP-17 (MOPS components): Ian Sullivan will be moved as part of the currently active milestone LCR.
DM-DAX-13 (GEN2 retired): Tim Jenness Should be a summer-2022 event. Is the definition "it has been removed" or "it can be removed"? Lauren signed off on science output equivalence between Gen2 and Gen3 - with even more outputs now in Gen3 than in Gen2. Decision: can be claimed in the February monthly report. Does not need to be "it has been removed from a release".
DM-DAX-14 (Provenance system review): Frossie Economou This is done. DM-DAX-9 will cover only Butler provenance. Where is the rest of it? Will claim DAX-14 in the January monthly report.
DM-AP-16 (INTEGRATION2): Ian Sullivan Will be moved forward by the current milestone LCR.
DM-DAX-9 (Butler provenance implementation): Add to current milestone LCR and tie to DP2?
DM-DRP-24 (DRP-MS-IMCHAR-3, PSF): Yusra AlSayyad To be moved by LCR
DM-DRP-29: Yusra AlSayyad will be done in April
DM-DRP-31: Yusra AlSayyad in LCR needs training data from LSSTCam
DM-DRP-36: Yusra AlSayyad make successor of AP
DM-DRP-37: Yusra AlSayyad not done, no action
DM-NCSA-21 (processing for ComCam): Unknown User (mbutler) see notes in spreadsheet, does not seem to be ready to claim
DM-NCSA-23: Unknown User (mbutler) DP0.2, will be claimable by June
DM-NCSA-14: Unknown User (mbutler) Rucio, to be done in next month or two
DM-SQRE-6 (SODA): Gregory Dubois-Felsmann Will be claimable based on functionality delivered for DP0.2.
LDM-GEN3: Leanne Guy Running tests now, planning to wrap up tests by end of month. DMTR-271.
LDM-503-12: didn't capture notes, covered by new LCR
LDM-503-15a: didn't capture notes
DM-NCSA-26: LCR and transfer to be a USDF/SLAC responsibility. Unknown User (mbutler) will check to see if this is in a pending LCR already.
DM-PORTAL: Frossie Economou Ready to claim, has been done
DM-SQRE-4: Frossie Economou Claimable now.
DM-SQRE-5: Frossie Economou Connect to end of the mini-surveys (effectively DP2)? This is a capability beyond LDM-503-RSPb.
LDM-503-12a: Need to associate with LSSTCam-on-sky. Push out to be around CAMM8090 (Camera pre-ship review at SLAC)?
LDM-503-13 (ops reh for DRP #1): Wil O'Mullane Intended to claim based on the functionality needed to deliver DP0.2. Yusra AlSayyad Consistent with review story that we are making up for delays in actual hardware by doing DM testing against the DPs.
LDM-503-14/15/16/17 are all in the current milestone LCR, tied to COMC-0100. Frossie Economou Worried about the scale of the test plans required for these milestones - do we have time to even write these tests? Tim Jenness This is effectively the test plan for the DMSR (LSE-61) itself (summed over all these milestones). Kian-Tat Lim LDM-503-14 should be just the Priority 1 parts of the DMSR.
Leanne Guy Maybe we need a milestone to represent the subset of DM functionality needed for ComCam? Robert Lupton Base it on actually processing AuxTel data. Wil O'Mullane That would just be the priority 1a milestones. LDM-503-14 would stay as is and would cover both priority 1a and 1b milestones.
Further discussion leads to reusing LDM-503-12a as the milestone for completing all the priority 1a requirements, and connect it to ComCam-on-sky in the schedule (currently May 2023).
Decision: -12a milestone will have been claimed at ComCam scale, then we will use the verification-monitoring process to retest at LSSTCam scale.
Frossie Economou notices that there are at least some 1a requirements that are not really properly needed for commissioning (like a full-user-community-scaled archive). Need a review / LCR for priorities.
|12:00||Status Part I|
>>> presentors = ['Leanne', 'Cristian', 'Frossie', 'KT', 'Fritz', 'Michelle', 'Ian', 'Yusra']
['KT', 'Yusra', 'Michelle', 'Ian', 'Cristian', 'Frossie', 'Fritz', 'Leanne']
(Not taking notes on what's on the presented slides, only questions/discussion.)
A little additional discussion of the decision to defer "client-server Butler" work.
KTL's slides triggered a vigorous side-discussion of "who owns campaign management?". Not a construction deliverable. Clearly a USDF responsibility as an Ops deliverable.
Evolved into a discussion of "what is campaign management?" - there seem to be two things being described under the same name, one of which doesn't yet have a clear written description.
Brief discussion of how to cover the coarser levels of the HiPS grid.
|Notetaker: Kian-Tat Lim|
|Status part II|
MichelleB is worried about moving data to SLAC and other transition issues
Some components from NTS are shared with other equipment; won't be shipped until Sep
Must get USDF running well before mid-August (aiming for June)
WOM: All data should be moved; could put some on tape and require asking for it to come back
WOM: Will Prompt Processing test measure the end-to-end alert delivery timing? Yes; taking the place of other planned measurements; profiling/optimization will happen in the future
Cristiàn is unavailable
No longer using "SUI"/"SUIT" terminology; use "Portal" instead
GPDF and FE to work on proposal for resources in the RSP
Burwood is contributing technical effort to cover some of shortfall due to Simon's absence, but not everything can be covered in a timely manner
Ways of non-interactive notebook execution: mobu for testing, Argo Workflow for "recommended", eventually "notebook batch". Developing new service to ask nublado to execute a notebook (with parameterization). Useful for DP delegates who only want to look at the results. Times Square presents results as a web page (caches to avoid re-execution). Also usable for end-of-night reports.
Qserv user-generated products has dependencies on things like understanding what if anything will happen with VOSpace
LG: SST trying to understand definition of done for DAX in Construction; will support DAX verification
GPDF: Is a waiver needed for user-generated products? Yes, build with Ops money
Special Programs support with non-standard pipelines generally moves into Ops, but need to make sure that RSP can handle SP products
Expect MW Gen3 verification campaign to be done by end of Feb
Day 3, Thursday Feb 17 2022 (Doesn't look needed as of Feb 7)
|Time (Project)||Topic||Coordinator||Pre-meeting notes||Running notes|
|Topic||Requested by||Time required (estimate)||Notes|
|Status of Prompt Processing||Eric Bellm||30||requesting status report/alignment from Kian-Tat Lim|
|USDF deployment||Eric Bellm||30|
if there is news to share from Richard Dubois on hardware deployment at the USDF; otherwise defer to a future DMLT
... maybe this is just a readthrough of RTN-021
|No-meeting holiday period||Eric Bellm||30||review experience with December-January no-meeting period|
|milestone parade||30-50||as we did last time - look over overdue and near term milestones ..|
|Missing functionality||30-50||Chucks list Missing Capabilities needed by Commissioning|
|QA||30||Quality of QA additions to lsst_distrib e.g. inability to run at IN2P3|
|big files and little files and little pieces of big files||Jim Bosch||30|
Pipelines is starting to optimize its own I/O by merging small files into bigger files at various points. RSP services are starting to read small subsets of our biggest files, and generally assuming that this is faster than reading the entire big file.
Are we fooling ourselves testing these on GPFS instead of object stores? Do we actually need to be writing more small files, or fewer bigger files? Does Rucio care?
|Image services (esp. for DP0.2)||30-50||Define the remaining missing pieces needed to deliver acceptable image services for DP0.2 - and beyond - and ensure that they are all assigned to specific groups/T-CAMs.|
|Build system||20||Do we want any changes as we reimplement the build system at SLAC? How can we improve reliability and maintainability? What outputs are needed that we don't currently have (e.g. images from CI runs)? Can we simplify and combine any existing processes (e.g. separate lsst_distrib, ci_hsc, ci_imsim, nightly-release runs)?|
|PanDA evaluation/selection||Jim Bosch||20||I've heard slightly conflicting reports about whether PanDA is definitely what we'll use for batch processing vs. something we're still evaluating (and if the latter, who is deciding, and when). I have some guesses about what might make us prefer it over HTCondor (or Parsl?), but they really are just guesses. I'd like to understand what the roadmap is here, both for long term use in operations and what our developers will be using after the move from NCSA to SLAC.|
Action Item Summary
Description Due date Assignee Task appears on 15 Mar 2022 Frossie Economou DM Leadership Team Virtual Face-to-Face Meeting, 2022-02-15 to 17 18 Mar 2022 Kian-Tat Lim DM Leadership Team Virtual Face-to-Face Meeting, 2022-02-15 to 17 15 Nov 2022 Gregory Dubois-Felsmann DMLT meeting-2022-10-24 24 Apr 2023 Gregory Dubois-Felsmann DMLT meeting-2023-04-17 04 Sep 2023 Leanne Guy DMLT meeting-2023-08-21 17 Nov 2023 Frossie Economou DM Leadership Team Virtual Face-to-Face Meeting - 2023-Oct-24 24 Nov 2023 Wil O'Mullane DM Leadership Team Virtual Face-to-Face Meeting - 2023-Oct-24 30 Nov 2023 Yusra AlSayyad DM Leadership Team Virtual Face-to-Face Meeting - 2023-Oct-24 30 Nov 2023 Frossie Economou DM Leadership Team Virtual Face-to-Face Meeting - 2023-Oct-24 11 Dec 2023 Gregory Dubois-Felsmann DM Leadership Team Virtual Face-to-Face Meeting - 2023-Oct-24 11 Dec 2023 Gregory Dubois-Felsmann DM Leadership Team Virtual Face-to-Face Meeting - 2023-Oct-24 Wil O'Mullane DMLT meeting-2023-09-11