(back to the list of all Panda meeting minutes)

Time

8 am PT

Attendees


Mikolaj Kowalik Wei Yang Brian Yanny Richard Dubois Jen Adelman-Mccarthy Edward Karavakis James Chiang Michelle Gower Peter Love Tim Jenness Zhaoyu Yang 

Regrets

Wen Guan (in WFM review)


Agenda:

  1. CM team news
    1. Summary of issues seen during PDR2 and RC2, a few selected issues below:
    2. Can we use global share (instead of priority) to manage the resources contention different activities?
    3. Some of the Long Running Panda Jobs aren’t doing anything. Why and what should we do about them?
    4. Batch node local scratch space issue.
  2. USDF Panda installation update
  3. Multi-DF test update
  4. AOB

Notes:

  1. CM news:
    1. step1-wide and -deep are done, 2M jobs. Only a few hundreds need rescue.
    2. Jobs submission and run were not steady, but ran smoothy.
    3. Long processing tail. Disk IO, etc. issues may contribute to this. (Note: this is common in distributed computing)
    4. Suggested improvement: be able to load millions of jobs to Panda: recently added throttling capability helps. 
    5. Setting priority for different campaigns help pushing both of them through. DDS (workflow) hasn't implement priority (does it need it?). JEDI (workload) supports priority. What about Global share (to balance different activities)?
    6. Step2 will start today and will be short. Step3 will use lots of RAM. May also use lots of local scratch space
      1. Talking to S3DF about setting local scratch as a SLURM trackable resource. Didn't generate much interests so far.
    7. Question about long running jobs: Do we have those kind of jobs?
      1. Zhaoyu: most jobs finishes within 6h. So set Panda timefloor to 6h (after that, pilot will not ask for a new job).
      2. Tim: we do see jobs running around 24h. That is why Panda job submission set walltime limit to 24h
  2. USDF Installation:
    1. S3 bucket for Postgres DB is ready.
    2. re-deployment of iDDS DB was OK, re-deployment of Panda DB was not
  3. Multi DF testing
    1. Not doing much but making room for campaigns.
    2. Still Zhaoyu tests again at FrDF for the ci_hsc_gen3 job (11k jobs). Again, didn't see any hanging.
    3. Peter: UKDF (Lancast) transition from SGE to SLURM schedule is in months.
  4. AOD:
    1. Slides of CHEP summary talk at ATLAS weekly by Eddie and Caterina Marcon attached
    2.