Panda Meeting 2023-12-13

(back to the list of all Panda meeting minutes)

Zoom Link

Time

8 am PT

Attendees

Brian Yanny Wei Yang Tim Jenness James Chiang Mikolaj Kowalik Jen Adelman-Mccarthy Michelle Gower Edward Karavakis Peter Love Wen Guan Colin Slater

Regrets

Fabio Hernandez (hosting and attending biannual LSST France meeting) Richard Dubois

Links

CM/Panda Interaction: https://confluence.lsstcorp.org/x/f9lGDQ
Panda Status: link
Panda team's Rubin Work/Priority List

Agenda:

CM news
Panda News:
Rubin 'HammerCloud' revisit, cont'd.
1. ARC CE monitoring at Lancaster: https://lsst.lancs.ac.uk/fabric/. It also run pipelines_check job. Can it be run via Panda?
2. older:
  1. Some comments in #rubinobs_panda channel https://lsstc.slack.com/archives/C01J0QS3X70/p1699597562986749 (and the one below)
  2. ci_hsc_gen3 test very useful. but is it too heavy as a HC.

Notes:

CM news:
1. Able to launch jobs via Panda to USDF/UKDF/FrDF. Can not see remote Butler. Need special command for that.
2. Increased QG generation time limit 1h → 24h.
3. Did a few stress tests, got sidetracked a bit, will continue stress test
  1. Saw 5-6K concurrent jobs at USDF Panda (???), will try 10K (Colin noted that 3K job per DF is the minimum request)
4. Discrepancy between CPU time and Wall time at USDF:
  1. deep-coadd / forced photometry ccd jobs, like N²operation.
  2. 8 out of 249 jobs have wall time ~ 8x CPU time. The reason jobs have wall time ~ 1x CPU time.
  3. Tim J. Slurm CPU pinning issue ?
5. Peter L. : it is possible to provide ssh to UKDF for the CM team. Likely true to FrDF as well.
  1. Post-meeting update: documentation on how to get an account at FrDF is here
Panda News:
1. Propose to use m-core jobs (m=8 initially).
  1. Pilot wrapper in a batch job will launch "m" pilots to fetch jobs from the same Panda Queues). Will have separate logs for these pilots/jobs (unchanged), except Harvester logs and CE records (they match HT condor submission).
  2. This will reduce load on batch system and CE but not reduce load on the Panda system. For the latter, we will look into Event Service (later step).
  3. Green lights from CMs team and Middleware team
2. For jobs with max memory request, what is the max memory per batch node? USDF: 500-512G, UKDF and FrDF? Need to know this to prevent submitting jobs that can't run at a DF.
  1. Post-meeting update: compute nodes usable by Rubin at FrDF have 2 hardware configurations: A) 64 CPU cores, 192 GB of RAM B) 112 CPU cores, 1 TB of RAM. Most of the nodes have configuration A. Both kind of nodes are reachable by jobs submitted via PanDA.
3. Wen will work on a uniform Panda Queue and Batch jobs name at all DFs, possibly with short names.
4. Eddie: optimizing DB partitions.
Peter L. is trying to upgrade monitoring at Lancaster to use pipeline_check jobs and via Panda.

Space shortcuts

Page tree

Zoom Link

Time

Attendees

Regrets

Links

Agenda:

Notes: