Panda Meeting 2023-04-19

(back to the list of all Panda meeting minutes)

Zoom Link

Time

8 am PT

Attendees

Richard Dubois Wei Yang Peter Love Brian Yanny James Chiang Wen Guan Edward Karavakis Mikolaj Kowalik Zhaoyu Yang Tim Jenness

Regrets

Agenda:

Update on Issues at USDF
1. Job state update with Panda at CERN (via NAT)
2. gs upload issue (via Squid); status of Loki? or log service at SLAC (does this only need an s3)?
Running test-med-1 at FrDF
1. borrow ~3000 for a short period
Issue of testing test-med-1 at UKDF
Update on Panda server installation at USDF
1. Installation
2. Update on IAM port 8443 issue
AOB

Notes:

Issue of update jobs state with Panda at CERN
1. Error message : The worker was finished while the job was running : None
2. Going through NAT at USDF, saw sudden increasing of this error ~ Mar 7, but quiet down recently:
  1. notes: several possible places that can cause this issue: NAT, WAN, Panda Server
3. See similar error at UKDF
Latest pilot: some jobs hang at "starting" state for 2h. Paul Nilsson (pilot developer) is looking into this
Logging:
1. real time (pipetask) logging currently to Google, Is Loki at SLAC a solution? Wen is talking to Yee
2. pilot logs (job end log file upload): can use s3.
3. gs update is mostly OK recently. Wen will put it as a lower priority than Panda service installation and iDDS improvement
Run step1 on DC2 subset at USDF and FrDF (and running at UKDF)
1. FrDF (10h, 100 slots, no retry needed). USDF (5.5h, O(1000) slots?)
2. Interested to see the walltime and retry/failure rate when bps clustering is used.
3. A later re-run at FrDF saw the (one and only one) merge job failed, even with 3 retries (while Panda was also busy running USDF jobs).
4. See failures at UKDF (more that just error in 1), want to know: why, and what to do with failures
Panda installation
1. dev server is running (DB deployed and connect to Panda server), running a small number of simple test jobs.
2. Wen is asking for a vcluster for production Panda service
Improved Harvesters
1. reduced number of empty pilot (to 3 per PQ - Panda Queue ) for PQ in pull mode.
2. Deployed at CERN Harvesters a few days ago. Deployed at USDF today
Second instance of ARC CE at USDF is running
Forgotten item: update on using bps submit instead of prun to generate q-graph. did it pass review?

Space shortcuts

Page tree

Zoom Link

Time

Attendees

Regrets

Agenda:

Notes: