(back to the list of all Panda meeting minutes)
Zoom Link
Time
8 am PT
Attendees
Richard Dubois Wei Yang Peter Love Brian YannyJames ChiangWen Guan Edward Karavakis Mikolaj Kowalik Zhaoyu Yang Tim Jenness
Regrets
Agenda:
- Update on Issues at USDF
- Job state update with Panda at CERN (via NAT)
- gs upload issue (via Squid); status of Loki? or log service at SLAC (does this only need an s3)?
- Running test-med-1 at FrDF
- borrow ~3000 for a short period
- Issue of testing test-med-1 at UKDF
- Update on Panda server installation at USDF
- Installation
- Update on IAM port 8443 issue
- AOB
Notes:
- Issue of update jobs state with Panda at CERN
- Error message : The worker was finished while the job was running : None
- Going through NAT at USDF, saw sudden increasing of this error ~ Mar 7, but quiet down recently:
- notes: several possible places that can cause this issue: NAT, WAN, Panda Server
- See similar error at UKDF
- Error message : The worker was finished while the job was running : None
- Latest pilot: some jobs hang at "starting" state for 2h. Paul Nilsson (pilot developer) is looking into this
- Logging:
- real time (pipetask) logging currently to Google, Is Loki at SLAC a solution? Wen is talking to Yee
- pilot logs (job end log file upload): can use s3.
- gs update is mostly OK recently. Wen will put it as a lower priority than Panda service installation and iDDS improvement
- Run step1 on DC2 subset at USDF and FrDF (and running at UKDF)
- FrDF (10h, 100 slots, no retry needed). USDF (5.5h, O(1000) slots?)
- Interested to see the walltime and retry/failure rate when bps clustering is used.
- A later re-run at FrDF saw the (one and only one) merge job failed, even with 3 retries (while Panda was also busy running USDF jobs).
- See failures at UKDF (more that just error in 1), want to know: why, and what to do with failures
- Panda installation
- dev server is running (DB deployed and connect to Panda server), running a small number of simple test jobs.
- Wen is asking for a vcluster for production Panda service
- Improved Harvesters
- reduced number of empty pilot (to 3 per PQ - Panda Queue ) for PQ in pull mode.
- Deployed at CERN Harvesters a few days ago. Deployed at USDF today
- Second instance of ARC CE at USDF is running
- Forgotten item: update on using bps submit instead of prun to generate q-graph. did it pass review?