Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Meeting recorder: Fritz (promised last time!)
  • Action Items last time: 
  • Announcements:
    • Yusra in Chile for next couple weeks as part of DM volunteer support program.  Helps with sharing info from Chilean part of project and non-Chilean parts.  If this interests you, consider signing up at ls.st/summit (correct link?)  Orion is going down next!
    • Ops rehearsal is next week. We are responsible for:
      • The nightly validation: $DRP_PIPE_DIR/pipelines/LSSTComCamSim/nightly-validation-ops-rehearsal-3.yaml#step2b,step2c,step2d,step2e,step3,step4,step5,step7  starting at 3am PT. To be done at 6am PT. With only exposures from the night before. 

      • Intermittent DRPs: Once starting Friday morning. (though could be convinced to do it half way through
      • Still looking for a pilot for daily post-observation pipeline runs...  Should run as one quantum graph, selected on dayobs.  Done manually?  Or set up a cron job?  Would be good to experiment with automation now...  Observations each day concluding at 3AM project time (6AM eastern).  Needs to be done Wed/Thur/Fri.
      • Huan: what's the scale, would be good to try this out beforehand.
        • ~1/4th of the precursor runs. ~couple hundred visits x 9 sensors. 1800 exposures. Probably take ~day to run the whole. ~10k's of quanta.
        • We should do a precursor run with the same configuration as we expect for Wednesday.  Plan to do this Mon/Tue?
    • Summary of monitoring meeting from Friday: I'll show more during show and tell, but Eddie's answer to how ATLAS does monitoring is "We ingest into Elasticsearch, and query with these dashboards." Wei will talk to Tim Noble about getting an instance set up at USDF. Elasticsearch is what HTCondor natively ingests into so we can handle both our WMS with one db.
    • Travel:
      • Orion to Chile 6-20th (traveling 6-7, 19-20).
      • Fritz to to TX for eclipse April 7-9, and Chile April 17-28.
      • Jen away from April 3-10.  
      • Jen also away May 27-31
      • Ken will be swamped! April 9-10.
      • Colin away April 5-12.
      • Huan away 5th through 9th.
      • Erin away 1st through 10th.
      • Hsin-Fang: possibly out the next 3 Fridays
  • Any blockers or inefficiencies to report on the Campaigns?
    • Ops Rehearsal Preparation: DRP on simulated ComCam (Homer/Erin)
    • RC2/DC2 (Jen/Orion)
      • Brian will take over while Jen is away.
      • Should we bump the RC by a week due to everyone's vacation schedule?
      • Previous weekly, had a problem with retries. Middleware change, wasn't expected. Should have a fix in for this week's weekly.
        • Jim merged it last night, should be in the stack.
        • Fix improved the situation so that it correctly skips quanta that were already completed, doesn't unnecessarily re-run the parts that succeeded.
    • Multisite PDR2-VVDS (Jen/Brian)
      • Still not quite ready for the next step. Jen has access to the rucio commands.
        • We should schedule a demo for the rucio transfers, after everyone's vacations.
    • USDF PanDA server scalability.  (Sierra)
      • Blocked on Panda not having errors available. Pilot logs are missing, AND infrastructure errors where there isn't even a log file on the google bucket.
      • Not directly an impediment to a time-critical operational need, but necessary for characterizing Panda/infrastructure at USDF.
    • LATISS Intermittent cumulative DRP (Huan)
    • Daily Calibration Product Production (Huan)
    • LATISS/ComCam Prompt Processing at the USDF (Hsin Fang)
      • Normal LATISS PP operations except last night there were infrastructure issues: a k8s node went offline with pods stuck in terminating, affecting multiple services for LATISS. 
      • ComCamSim testing might have broken the longest thread record on slack.
      • Where to discuss the things that the "Campaign Committee" should decide.
        • What decision: Which stack? setup and tradeoff, etc
        • So far, Eric Colin Yusra Hsin Fang etc make ops decisions in Slack channels #auxtel-prompt-processing and #comcam-prompt-processing 
        • Prompt Processing group should make the call, if Eric is good with it. Continue the usual way and not do it differently for ops rehearsal.
        • Will have similar questions about software versions for DRP. Yusra.
    • Monthly ApPipe HSC/DC2 (Erin)
      • Not yet quantified, but quantum graph generation is taking a lot longer than it used to.
      • Report to Middleware, they're interested in knowing about anything that's out of the ordinary
      • Huan: had a graph build that took a long time, but was faster on retrying later that same day. Monday this week. Could be some sort of database issue?
      • Jen: there are some hints one can give that speeds up graph building.
    • Orion:
      • w_2023_39 lost logs. Pipetask report found the dataIds that had missing logs. ForcePhotCCD took forever to run. Ran with HTCondor. Will need Condor help; has been running for a week ???!
      • Hard to tell what's going on when something is taking forever.
      • Michelle can tell Orion how to look at what the HTCondor job is doing, look at what the logs are doing.
        • +1 on "real time logging"
      • Side note: Greg looking into cgroups for enforcing limits at condor layer. Preliminary report: infrastructure has an older version of cgroups than condor needs.
  • Tooling.
    • News:
      • Orion fixed a bug in pipetask report. Permissions errors on saving to a directory that you can't write to. Sierra wrote a test!
    • https://jira.lsstcorp.org/secure/RapidBoard.jspa?rapidView=259
    • Unsorted list:
      • Use couple edge cases of infrastructure errors we failed to retry in the fall RC2s. We can use these as how-to-reproduce to ensure that we can identify these with our bps report/ pipetask report tools. 
      • Analyze the scalability test: Report payload errors (file tickets), correlate infrastructure errors with data facility telemetry
      • Increase test coverage, incl learning how to use. Maybe party-review (smile)
      • cm-service decision to retry based on info from new error reporting tools. 
      • Decide on which flavor or retry (e.g. leave empty or bad collections that don't end up in the chained collection) (can discuss 2/21) see DM-41617 - CM service, rollback and delete functionality TO DO "roll-back and delete functionality"
      • add pipetask report to the default scripts run by bps submit. 
      • Improvements required as a result of most recent RC2 run with cm-service.
        • Eric thinks cm-service will be ready for others to try the micro campaign after that. 
        • And then we should have another conversation about the specification files. 
      • Touch base with Ash and Gregory Poole. 
      • pipetask report commandline tool MERGED! (
        Jira
        serverJIRA
        serverId9da94fb6-5771-303d-a785-1b6c5ab0f2d2
        keyDM-41606
        ) thank you Ken!!
        • in command-line: pipetask report REPO QGRAPH
        • pipetask report --help gives good directions
        • This can be used for any run that has an associated qgraph! Ie, runs in CM, pipeline tasks, resource usage, etc!
      • extension to aggregate multiple qgraphs (
        Jira
        serverJIRA
        serverId9da94fb6-5771-303d-a785-1b6c5ab0f2d2
        keyDM-41711
        )
        • Deciding on format for report on Datasets – feedback and opinions welcome/requested!
        • Have added explicit reporting on "recovered" quanta.
      • Figure out what's left to all us to transition from functionality to usability
        • Orion and Sierra to meet with Eric tomorrow (2/23, 1pm) about setting up cm-service for future campaigns.
        • Fritz to do a campaign with cm-service.
    • Eric had a smooth RC2 run (aside from some issues with faro_matched and daemon restarts due to PanDA tokens / power outage)
      • PanDA team is checking to see if the token can be longer than 7 days.
      • Will ultimately need a service account
    • Fritz has been using cm_service to run campaigns and it appears working now!

...