Panda Meeting 2023-11-03

(back to the list of all Panda meeting minutes)

Zoom Link

Time

8 am PT

Attendees

Brian Yanny Wei Yang James Chiang Michelle Gower Richard Dubois Tim Jenness Edward Karavakis Torre Wenaus Wen Guan Peter Love Colin Slater

Regrets

Yusra AlSayyad Fabio Hernandez

Links

CM/Panda Interaction: https://confluence.lsstcorp.org/x/f9lGDQ
Panda Status: link
Panda team's Rubin Work/Priority List

Agenda:

CM news
USDF Panda

- db perf turning: reducing # of record/commit <=2000
- other scaling tests

Panda next step: Event Service and M-core
1. https://docs.google.com/document/d/1lWGVvUrH2dIBWa04eaPU3BFmhif0sKQd5GEgZ9rTnS4/edit

Notes:

Panda Team News:
1. Zhaoyu is unavailable in the near future. Wen will take over some of Zhaoyu's responsibility.
2. Panda team is looking into an ticketing systems that non-ATLAS people can access, likely Github
3. Looking into a time slot so that Rubin folks (e.g. Wei) can attend the Panda development meeting. Wen and Tadashi will work out the detail.
CM news:
1. interested in stress test using USDF Panda. Has green light from the Panda team to proceed.
USDF Panda
1. Panda Rubin works/priority list (above, in ref section)
2. Current issues:
  1. reduced all db record commit to < 2000. Most of the DB commit takes 1-2 seconds but a few take 30-60s, reason unknown.
  2. the 30-60s delay in iDDS db commit prevents iDDS from releasing jobs to Panda in time, which starves Panda
  3. "timefloor" in Panda jobs were set to 6h (a Panda pilot will keep on asking new jobs until it reaches to 6h), but this won't help when iDDS can't release jobs fast enough to Panda.
  4. would like to have a meeting with Dan Speck to go over CNPG config (Wei will organize). Also would like some help from DBAs.
  5. Eddie would like to test CNPG with local disk as storage. Wei will ask S3DF.
3. Next plan idea:
  1. m-core jobs and event service (see "How to bulk" in the above link). Two ideas: group jobs in width (vertical to the pipeline, thus m-core) and group jobs in depth (along the pipeline, current idea is to utilize bps clustering)
  2. both reduces the pressure to Slurm and ARC CE.
  3. ES: treats Rubin pipeline a quanta (e.g. isr) and if it is an HEP event. It (helpfully) reduces the pressure on Panda DB.
  4. Questions to be answered:
    1. one pipe task failure will not affect other (until there is a dependence)
    2. realtime logs in ES and M-core: each pilot can only have one realtime log stream, so multiple pipeline tasks will show the same stream (a log entry itself has tag to identify itself). Is this acceptable to the CM team?
  5. Colin: was bps clustering length (~5) limited by the 4000 characters previously? We now set it to 8000. How does it limit the bps cluster now? Note that Postgres allows 64k blob. Another option is to have the middeware team to completely eliminate the limit in some way (e.g. by passing the uuids via a file)

Space shortcuts

Page tree

Zoom Link

Time

Attendees

Regrets

Links

Agenda:

Notes: