Attending

Regrets

Agenda

  • Meeting recorder: Sierra
  • Action Items last time: 
  • Announcements:
    • Sierra on leave June 28 to July 5
    • Jen on leave May 27-31
    • Update from Richard on Opensearch: it looks like opensearch has been installed for Rucio. Wei is asking about using it for PanDA. (see #dm-rucio-testing.)
      • one project working w/ Eddie on ingesting data + another project on getting data ingested into HTCondor + project to display data on Grafana
    • Calibrations rehearsal week of May 28. Huan do we need an official "campaign" for your processing for this?
      • LATISS off sky until May 28
  • Some planning:
    • Moved all the ops rehearsal 3 campaigns down to Completed.  Please finish them. The failure reconciliations, checklists and close out your tickets. The pace is going to really pick up in August and we're not going to be able to keep up if it takes a month to finish a campaign.
    • Watch out news on release candidates for v27 
    • Upcoming one-time campaigns need Pilots:
      • DRP: Template generation for Ops Rehearsal 4  (early June)
      • DRP: HSC RC2 with v27.0.0.rcN for the v27 Characterization report (Whenever Matthias/Jim/K-T announce that all backports have been made. Expecting late May)
      • PP: Ops Rehearsal 4 (June 25-27)
      • DRP: Nightly Validation for Ops Rehearsal 4 (last observation at 6:48am ET) (June 25-27)
    • Monitoring. Both PanDA and HTCondor can get data ingested into elasticSearch and graphed by grafana. The OpenSearch  is a branch of ElasticSearch. Peter Love will install it at USDF and ask Wen for updates.  A project working with Eddie for PanDA data ingesting. Another project working with HTCondor.  Another project on displaying in Grafana.  
    • sprint planning on tooling dev
  • Any blockers or inefficiencies to report on the Campaigns?
    • Ops Rehearsal Wrap Up (Homer)
    • USDF PanDA server scalability Wrap Up (Sierra)
      • Chasing down weird quantumGraph generation inconsistency (Colin: "the worst novelty")
    • RC2/DC2 (Jen/Orion/Brian)
      • Waiting for v20 to hit cvmfs and then Jen will run the DC
    • Multisite PDR2-VVDS (Jen/Brian)
    • LATISS Intermittent cumulative DRP (Huan)
    • Daily Calibration Product Production (Huan)
    • LATISS Prompt Processing at the USDF (Hsin Fang) 
    • Monthly ApPipe HSC/DC2 (Erin)
    • Eric and Fritz's HSC Weekly Test (weekly 19)
      • six failures with skycorr that didn't exist on a previous weekly (OOM failures? may be related to these being put into the held state)
        • crashing at 64*original allocation GB of memory, so maybe can't just be run with more memory (Milano can get up to ~480 GB)
        • skycorr is requesting ~16 GB originally (if default file is being used)
          • has never needed to be bumped up for weekly 18 or previous
      • whatever missing links are in the reporting chain would be good to find and fix
      • pipetask report currently doesn't have a way to say that it is an out-of-memory error
  • Sprint Planning
    • Upcoming one-time campaigns need Pilots:
      • DRP: Template generation for Ops Rehearsal 4  (early June) (Orion/Erin) 
      • DRP: HSC RC2 with v27.0.0.rcN for the v27 Characterization report (Whenever Matthias/Jim/K-T announce that all backports have been made. Expecting late May) (Sierra/?)
      • PP: Ops Rehearsal 4 (June 25-27)  (Hsin-Fang)
      • DRP: Nightly Validation for Ops Rehearsal 4 (last observation at 6:48am ET) (June 25-27) (question)
      • May 28 Claibration Rehearsal (Huan) Please add to the Campaigns board if this doesn't fall under the usual 
    • Monitoring. Both PanDA and HTCondor can get data ingested into elasticSearch and graphed by grafana. The OpenSearch  is a branch of ElasticSearch. Peter Love will install it at USDF and ask Wen for updates.  
      • A project working with Eddie for PanDA data ingesting.
      • Another project working with HTCondor → opensearch.  
      • Another project on displaying in Grafana.  
    • CM-service web interface (backend improvements to support UI progress)
      • last-modified columns needed in some tables to support UI (assignee?)
      • description columns needed in some tables to support UI (assignee?)
    • CM-service command-line interface (cm-client)
      • sort rows in table outputs  (assignee? come with offer to pair code!)
      • wide column / column selection improvements in table outputs (assignee? come with offer to pair code!)
    • CM-service backend
      • found a couple missing package dependencies which need to be added 
      • need some postgres schema support (per user?) to help avoid trampling/confusion during devs-running-multiple-services era
      • server-side daemon
        • will need panda service token
        • include managed condor glide-in "auto" behavior
      • service running in k8s (phalanx probably, Colin offers to pair code)
        • Colin: gets us out of kubectl port-forward mode?  Fritz: yes.
      • need db schema evolution support (probably Alembic)
      • unit tests (always/still...)
    • Our ability to reconcile payload errors (expected vs succeeded) at the stage level (not the workflow/qg level) to generate e.g. PDR2 v24 Error Characterization is still very manual. 
      • Reconcillation at the group level – Needs tests for QuantumProvenanceGraph (Orion)
      • Michelle's "Expected" calculation – Tells how many ____ are expected to be run (clusters? quanta?) (Orion)
      • Group → Campaign level rollup (Orion)
      • Error matching, possibly via QuantumProvenanceGraph (Orion)
    • Ability to separate payload failures from infrastructure failures. 
      •  Accessible error codes from HTCondor out-of-memory jobs
      • DM-44371 - Getting issue details... STATUS



  •  
    • UI work.
    • Jen & Fritz will schedule another meeting 
    • https://jira.lsstcorp.org/secure/RapidBoard.jspa?rapidView=259
    • Unsorted list:
      • Use couple edge cases of infrastructure errors we failed to retry in the fall RC2s. We can use these as how-to-reproduce to ensure that we can identify these with our bps report/ pipetask report tools. 
      • Analyze the scalability test: Report payload errors (file tickets), correlate infrastructure errors with data facility telemetry
      • Increase test coverage, incl learning how to use. Maybe party-review (smile)
      • cm-service decision to retry based on info from new error reporting tools. 
      • Decide on which flavor or retry (e.g. leave empty or bad collections that don't end up in the chained collection) (can discuss 2/21) see DM-41617 - CM service, rollback and delete functionality TO DO "roll-back and delete functionality"
      • add pipetask report to the default scripts run by bps submit. 
      • Improvements required as a result of most recent RC2 run with cm-service.
        • Eric thinks cm-service will be ready for others to try the micro campaign after that. 
        • And then we should have another conversation about the specification files. 
      • Touch base with Ash and Gregory Poole. 
      • pipetask report commandline tool MERGED! ( DM-41606 - Getting issue details... STATUS ) thank you Ken!!
        • in command-line: pipetask report REPO QGRAPH
        • pipetask report --help gives good directions
        • This can be used for any run that has an associated qgraph! Ie, runs in CM, pipeline tasks, resource usage, etc!
      • extension to aggregate multiple qgraphs ( DM-41711 - Getting issue details... STATUS )
        • Have added explicit reporting on "recovered" quanta.
      • Figure out what's left to all us to transition from functionality to usability
    • Eric had a smooth RC2 run (aside from some issues with faro_matched and daemon restarts due to PanDA tokens / power outage)
      • PanDA team is checking to see if the token can be longer than 7 days.
      • Will ultimately need a service account
    • Fritz has been using cm_service to run campaigns and it appears working now!
    • BPS not working with HTCondor/Parsl in w14 due to a merge for multisite w/PanDA. Fix is in place for the next weekly.
    • Qgraph summary file MERGED 4/16 ! ticket DM-41542, thank you Michelle!
    • Eric: for CM service to use HTCondor plugin, need to have the problem of too many large Qgraph generations running on the dev nodes sorted out. Expect to need to have it done within a job or service on a compute node.
    • Michelle trying to get Greg or Nick to tweak some API parameters to enable allocateNodes and such to be run by bps, but still a little ways off (some weeks, if not longer).
  •  Show and tell:
  • AOB