(back to the list of all Panda meeting minutes)

Time

8 am PT

Attendees


Richard Dubois Wei Yang Peter Love Brian YannyJames ChiangWen Guan Edward Karavakis Mikolaj Kowalik  Zhaoyu Yang Tim Jenness 

Regrets


Agenda:

  1. Update on Issues at USDF
    1. Job state update with Panda at CERN (via NAT)
    2. gs upload issue (via Squid); status of Loki? or log service at SLAC (does this only need an s3)?
  2. Running test-med-1 at FrDF
    1. borrow ~3000 for a short period
  3. Issue of testing test-med-1 at UKDF 
  4. Update on Panda server installation at USDF
    1. Installation
    2. Update on IAM port 8443 issue 
  5. AOB


Notes:

  1. Issue of update jobs state with Panda at CERN
    1. Error message : The worker was finished while the job was running : None
    2. Going through NAT at USDF, saw sudden increasing of this error ~ Mar 7, but quiet down recently:
      1. notes: several possible places that can cause this issue: NAT, WAN, Panda Server
    3. See similar error at UKDF 
  2. Latest pilot: some jobs hang at "starting" state for 2h. Paul Nilsson (pilot  developer) is looking into this
  3. Logging:
    1. real time (pipetask) logging currently to Google, Is Loki at SLAC a solution? Wen is talking to Yee
    2. pilot logs (job end log file upload): can use s3. 
    3. gs update is mostly OK recently. Wen will put it as a lower priority than Panda service installation and iDDS improvement
  4. Run step1 on DC2 subset at USDF and FrDF (and running at UKDF)
    1. FrDF (10h, 100 slots, no retry needed). USDF (5.5h, O(1000) slots?)
    2. Interested to see the walltime and retry/failure rate when bps clustering is used.
    3. A later re-run at FrDF saw the (one and only one) merge job failed, even with 3 retries (while Panda was also busy running USDF jobs).
    4. See failures at UKDF (more that just error in 1), want to know: why, and what to do with failures
  5. Panda installation
    1. dev server is running (DB deployed and connect to Panda server), running a small number of simple test jobs.
    2. Wen is asking for a vcluster for production Panda service
  6. Improved Harvesters
    1. reduced number of empty pilot (to 3 per PQ - Panda Queue ) for PQ in pull mode.
    2. Deployed at CERN Harvesters a few days ago. Deployed at USDF today
  7. Second instance of ARC CE at USDF is running
  8. Forgotten item: update on using bps submit instead of prun to generate q-graph. did it pass review?