(back to the list of all Panda meeting minutes)
Zoom Link
Time
8 am PT
Attendees
Brian Yanny Wei Yang Tim Jenness James Chiang Mikolaj Kowalik Jen Adelman-Mccarthy Michelle Gower Edward Karavakis Peter Love Wen Guan Colin Slater
Regrets
- Fabio Hernandez (hosting and attending biannual LSST France meeting) Richard Dubois
Links
- CM/Panda Interaction: https://confluence.lsstcorp.org/x/f9lGDQ
- Panda Status: link
- Panda team's Rubin Work/Priority List
Agenda:
- CM news
- Panda News:
- Rubin 'HammerCloud' revisit, cont'd.
- ARC CE monitoring at Lancaster: https://lsst.lancs.ac.uk/fabric/. It also run pipelines_check job. Can it be run via Panda?
- older:
- Some comments in #rubinobs_panda channel https://lsstc.slack.com/archives/C01J0QS3X70/p1699597562986749 (and the one below)
- ci_hsc_gen3 test very useful. but is it too heavy as a HC.
Notes:
- CM news:
- Able to launch jobs via Panda to USDF/UKDF/FrDF. Can not see remote Butler. Need special command for that.
- Increased QG generation time limit 1h → 24h.
- Did a few stress tests, got sidetracked a bit, will continue stress test
- Saw 5-6K concurrent jobs at USDF Panda (???), will try 10K (Colin noted that 3K job per DF is the minimum request)
- Discrepancy between CPU time and Wall time at USDF:
- deep-coadd / forced photometry ccd jobs, like N2 operation.
- 8 out of 249 jobs have wall time ~ 8x CPU time. The reason jobs have wall time ~ 1x CPU time.
- Tim J. Slurm CPU pinning issue ?
- Peter L. : it is possible to provide ssh to UKDF for the CM team. Likely true to FrDF as well.
- Post-meeting update: documentation on how to get an account at FrDF is here
- Panda News:
- Propose to use m-core jobs (m=8 initially).
- Pilot wrapper in a batch job will launch "m" pilots to fetch jobs from the same Panda Queues). Will have separate logs for these pilots/jobs (unchanged), except Harvester logs and CE records (they match HT condor submission).
- This will reduce load on batch system and CE but not reduce load on the Panda system. For the latter, we will look into Event Service (later step).
- Green lights from CMs team and Middleware team
- For jobs with max memory request, what is the max memory per batch node? USDF: 500-512G, UKDF and FrDF? Need to know this to prevent submitting jobs that can't run at a DF.
- Post-meeting update: compute nodes usable by Rubin at FrDF have 2 hardware configurations: A) 64 CPU cores, 192 GB of RAM B) 112 CPU cores, 1 TB of RAM. Most of the nodes have configuration A. Both kind of nodes are reachable by jobs submitted via PanDA.
- Wen will work on a uniform Panda Queue and Batch jobs name at all DFs, possibly with short names.
- Eddie: optimizing DB partitions.
- Propose to use m-core jobs (m=8 initially).
- Peter L. is trying to upgrade monitoring at Lancaster to use pipeline_check jobs and via Panda.