(back to the list of all Panda meeting minutes)
Zoom Link
Time
8 am PT
Attendees
Tim Jenness Wei Yang Brian Yanny Richard Dubois Wen Guan Peter Love Zhaoyu Yang Jen Adelman-Mccarthy Fabio Hernandez James Chiang Edward Karavakis Mikolaj Kowalik
Regrets
Agenda:
- Update on running all 7 steps at USDF
- How to work around TMPDIR space issue ?
- Update on Panda installation
- Review memory setting at USDF, and what we want at EUDFs?
- Wen Guan Are these info in Panda/CRIC Json files?
Notes:
- Ran all 7 steps at USDF successfully (https://lsstc.slack.com/archives/C01J0QS3X70/p1683055589999199)
- bps clustering enabled
- merge job at step 4 sometimes seg fault. Not easy to repeat/debug manually (except when running out of workdir space, but that is not the case in those crashed Panda jobs). Still chasing this issue.
- Memory requirement plot by Zhaoyu for the above tests https://files.slack.com/files-pri/T06D204F2-F055H2KDBJT/memory.png
- Compare this to DP0.2 run at FrDF https://me.lsst.eu/fabio/rubin-dp0.2-at-frdf/pipetasks/html/memory-consumption-per-pipetask.html
- Zhaoyu noticed that it is difficult to schedule many concurrent 8GB jobs at UKDF. Peter: UKDF is using SGE batch system, and will migrate to SLURM.
- Richard: USDF and FrDF have monitoring of batch system usage (by Rubin), etc. Does UKDF have it? Peter: UKDF will after migrate to SLURM.
- Note this is different from the monitoring info Panda will provide (which doesn't know non-Panda jobs by Rubin users).
- Panda Server at SLAC
- Tested with FrDF and UKDF (in addition to USDF) w/ Panda-dev. Works
- Working with Yee to setup a production k8s cluster for Panda
- IAM (@SLAC) port issue (8443 vs 443) resolved. However, copying registered users in IAM@CERN to SLAC failed due to version difference. Users will have to re-register w/ IAM@SLAC. Support registration with Google account
- IDF users will be divided to different service levels. Will likely also need several corresponding service accounts at USDF.
- Wen Guan Edward Karavakis will write a RTN. Also need to document how to manage services and users.
- Review Panda Queues at USDF
- USDF PQs and memory
- Note on HW config: USDF 4GB/core, FrDF 3GB/core, UKDF 4GB/core. Jobs can request more (or less) memory per core at all DFs.
- Some jobs require O(100GB) memory, MAX RAM/node: USDF 512GB, FrDF: 256GB. We need to test and find out what is the max RAM we need in real production.
- Action Items/plan
- Replicate USDF PQs in CRIC for FrDF and UKDF and import them to Panda (Wen: done)
- Note that in the PQ name, we have SLAC/USDF, CCIN2P3/FrDF, LANCS/UKDF... (low priority)
- Run all 7 steps at all DFs simultaneously. We can use Panda@CERN (don't have to wait for Panda@SLAC). The goal is to see if we will break anything, and expose them sooner.
- Work with CM team on their upcoming big run.
- Replicate USDF PQs in CRIC for FrDF and UKDF and import them to Panda (Wen: done)
- USDF PQs and memory
- AOD
- No meeting on May 10 (Fabio, Eddie, Torre, Wei will be at CHEP. Rogue Jones (Lancaster) will be there too).