(back to the list of all Panda meeting minutes)

Time

8 am PT

Attendees

Tim Jenness Wei Yang Brian Yanny Richard Dubois Wen Guan Peter Love Zhaoyu Yang Jen Adelman-Mccarthy Fabio Hernandez James Chiang Edward Karavakis Mikolaj Kowalik 

Regrets


Agenda:

  1. Update on running all 7 steps at USDF
    1. How to work around TMPDIR space issue ?
  2. Update on Panda installation
  3. Review memory setting at USDF, and what we want at EUDFs?
    1. Wen Guan  Are these info in Panda/CRIC Json files?

Notes:

  1. Ran all 7 steps at USDF successfully (https://lsstc.slack.com/archives/C01J0QS3X70/p1683055589999199)
    1. bps clustering enabled
    2. merge job at step 4 sometimes seg fault. Not easy to repeat/debug manually (except when running out of workdir space, but that is not the case in those crashed Panda jobs). Still chasing this issue.
    3. Memory requirement plot by Zhaoyu for the above tests https://files.slack.com/files-pri/T06D204F2-F055H2KDBJT/memory.png
      1. Compare this to DP0.2 run at FrDF https://me.lsst.eu/fabio/rubin-dp0.2-at-frdf/pipetasks/html/memory-consumption-per-pipetask.html
    4. Zhaoyu noticed that it is difficult to schedule many concurrent 8GB jobs at UKDF. Peter: UKDF is using SGE batch system, and will migrate to SLURM. 
    5. Richard: USDF and FrDF have monitoring of batch system usage (by Rubin), etc. Does UKDF have it? Peter: UKDF will after migrate to SLURM.
      1. Note this is different from the monitoring info Panda will provide (which doesn't know non-Panda jobs by Rubin users).
  2. Panda Server at SLAC
    1. Tested with FrDF and UKDF (in addition to USDF) w/ Panda-dev. Works
    2. Working with Yee to setup a production k8s cluster for Panda
    3. IAM (@SLAC) port issue (8443 vs 443) resolved. However, copying registered users in IAM@CERN to SLAC failed due to version difference. Users will have to re-register w/ IAM@SLAC. Support registration with Google account
    4. IDF users will be divided to different service levels. Will likely also need several corresponding service accounts at USDF. 
    5. Wen Guan Edward Karavakis will write a RTN. Also need to document how to manage services and users. 
  3. Review Panda Queues at USDF 
    1. USDF PQs and memory
      1. Note on HW config: USDF 4GB/core, FrDF 3GB/core, UKDF 4GB/core. Jobs can request more (or less) memory per core at all DFs.
      2. Some jobs require O(100GB) memory, MAX RAM/node: USDF 512GB, FrDF: 256GB. We need to test and find out what is the max RAM we need in real production.
    2. Action Items/plan
      1. Replicate USDF PQs in CRIC for FrDF and UKDF and import them to Panda (Wen: done)
        1. Note that in the PQ name, we have SLAC/USDF, CCIN2P3/FrDF, LANCS/UKDF... (low priority)
      2. Run all 7 steps at all DFs simultaneously. We can use Panda@CERN (don't have to wait for Panda@SLAC). The goal is to see if we will break anything, and expose them sooner.
      3. Work with CM team on their upcoming big run.
  4. AOD
    1. No meeting on May 10 (Fabio, Eddie, Torre, Wei will be at CHEP. Rogue Jones (Lancaster) will be there too).