Panda Meeting 2023-05-17

(back to the list of all Panda meeting minutes)

Zoom Link

Time

8 am PT

Attendees

James Chiang Wei Yang Tim Jenness Richard Dubois Jen Adelman-Mccarthy Edward Karavakis Wen Guan Michelle Gower Zhaoyu Yang

Regrets

Brian Yanny

Agenda:

memory queues and DF specific memory policy (upper limits apply to RSS or VSZ)
1. (has to pause to make room for CM team to use the Panda server resource)
CM team news
USDF Panda installation
1. where are we in -prod installation. Est. timeline?
2. rubin-panda-iam-dev DB issue
3. DB responsibilities: responsibility of DB architecture, deployment vs operation
AOB

Notes:

multi-DF testing:
1. USDF (slurm) kills jobs when RSS (real memory in RAM) reach the limit
  1. From Renata: we have ConstrainRAMSpace=yes in our slurm cgroup.conf which means “constrain the job’s RAM usage by setting the memory soft limit to the allocated memory and the hard limit to the allocated memory."
2. FrDF (slurm) kills jobs when VM exceeds limit (not expected. Need investigation. Wei is asking SLAC Slurm admins, see above)
3. UKDF (SGE) kills jobs based on VM. Expected and will move to Slurm in the future.
4. This inconsistence will make it difficult for CM team to manage jobs.
CM team news (Brian's notes)
1. submitting large number of jobs via bps submit overwhelmed Panda-DOMA (cern). Wen believes this is an issues with the messaging system between iDDS and JEDI (and retry amplifies the issue). Tadashi Maeno is woking on improvement, and will reduce the number of ActiveMQ subscription by JEDI (so msg will buffer in ActiveMQ).
2. How do we handle failure in clustering (along pipeline tasks) and grouping (along detectors, vertical to pipeline).
  1. Tim: for some tasks (isr?), failure is OK and can be ignored, we can still move to the next task.
  2. Wen: Panda team plans to use event service, and treat each quanta as an "event". This will allows grouping while still keep the processing of each quanta separated.
Panda installation at USDF
1. Waiting for host certs, DB stuffs (certs are ready now)
2. IAM in -dev instance lost its DB content during the pod restart. Suggested to use PVC (PersistentVolumeClaim) for DB storage. But how is the PVC being backed up. Do we need our own backup (mysql dump, since the IAM info is relatively static)?
3. Postgres DB backup will use S3. Eddie: waiting for a prod S3 for Panda -prod.
4. Notice that DB backup is not a replacement of HA. Richard: HA mechanism in K8s tested

Space shortcuts

Page tree

Zoom Link

Time

Attendees

Regrets

Agenda:

Notes: