Panda Meeting 2023-07-18

(back to the list of all Panda meeting minutes)

Zoom Link

Time

8 am PT

Attendees

Michelle Gower Peter Love Brian Yanny Richard Dubois Jen Adelman-Mccarthy Jhonatan Amado Wei Yang Wen Guan Zhaoyu Yang ...

Regrets

Agenda:

Focus on issues during the weekend of 7/15-16. Several comments/observation

Could the issue be related to increase load in Panda system
Could the issue be related to networking issue (Harvester wasn't able to built step 3 jobs/ Extra_himem jobs)
RHEL8 Core dump filled up /var/crash and causing SLURM to drain batch nodes (temporarily addressed)

Notes:

Sunday's issue: couldn't fetch jobs from Panda: fixed after restart at CERN)
Monday's issue: Wen believe that the issue during the weekend was due to harvester: 1) fetch jobs in panda but not able to prepared the jobs for submission 2) finished jobs were not staged out. In Rubin's cases, these two steps actually don't to anything. One possible cause was the slowness of shared file system were harvester put a Sqlite DB on.

Question: could the issue be monitored? probably yes but not very meaningful because understanding it requires deep knowledge of how harvester works.
Mitigration

Short term mitigation: create /opt/harvester_tmp and move harvester Sqlite DB to there (note the DB backup in the case isn't useful unless we are able to restore it in 30min, otherwise Panda will kill jobs and resubmit). Dir created.

Long term mitigation: request a dedicated VM with local storage for DB, and use MySQL instead of Sqlite (requested submitted to S3DF). The VM will be in DMZ (no need to use squid and NAT)

Other difference between Step1 and Step3:
- step1 jobs are mostly in pull mode: pilots re-used frequent, resulting in few pilot submission.
- step3 jobs are mostly in push mode. each job require a job submission (by harvester/htcondor). resulting in much more pilot submission.
s3 for pilot log is available, still testing

Space shortcuts

Page tree