Logistics

Date 

2022 February 15-17

Location

This meeting will be  virtual;

Join Zoom Meeting

Meeting: https://noirlab-edu.zoom.us/j/93368173995?pwd=Ni93Q1hRZVJ2WjJncHlNbTVoTFVQdz09

Meeting ID: 933 6817 3995 

Password: 369811


Attendees:


Day 1, Tuesday Feb 15 2022

Time (Project)TopicCoordinatorPre-meeting notesRunning notes

Moderator: Yusra AlSayyad 

Notetaker: Ian Sullivan 

09:00Welcome - Project news and updates
  • Project updates
    • Few updates since Victor could not make the project meeting
    • Reports are that Omicron has peaked in Chile
    • Not enough vehicles available on the summit for workers staying late.
09:30Missing Functionality

Chucks list Missing Capabilities needed by Commissioning 

Missing Functionality as tickets:  Getting issues...

  • All the missing functionality is loaded in Jira, can be sorted by priority: Getting issues...
    • FAFF: First-Look Analysis and Feedback Functionality 
    • FE: worry that we have identified a bunch of missing scope. DM is technically well placed to assist with these efforts, but everyone is already fully loaded with jobs. What is the exposure for the T/Cams to this list?
    • WOM: There is effort required, we have to support this. There is budget for dealing with this, and we may have to push some work out.
    • RHL: We have to coordinate on the commissioning side, and not ask T/Cams prematurely. Commissioning will have to negotiate with the construction project.
    • KTL: There is not a well-defined interface between DM and Sit-com, which makes coordinating effort hard. This is a management problem to solve.
    • FE: Only two things we can give up: either scope from construction (Victor has ruled this out) or give up schedule. We are already working full-out on construction and operations milestones.
    • RHL: we're looking at what tools are missing to even be able to run AuxTel.
      • We have the opportunity to realign the effort from early Ops
      • FE: We publicly committed to supporting Data Previews, which ties our hands for switching to supporting commissioning
      • GDPF: agrees there is tension between supporting the early data previews program
      • WOM: there should not be a DP0.2
      • YA: The solution is to get Bob to recognize these efforts as pre-ops, and fund it that way
      • GPDF: What does happen next?
        • WOM: The first ticket is to figure that out, WOM expects to chase the other tickets down. See if we can free up any resources in DM
          • There is funding for all this work, just not free people
        • KTL: Will there be DM representation in FAFF2?
          • FE: SK was doing this in his science time. We need to see the charge, but then need at least 2 people to attend.
          • RHL attends these meetings
10:00QAQuality of QA additions to lsst_distrib e.g. inability to run at IN2P3
  • Faro code required a lot of memory - possibly should not have been added to lsst_distrib
    • Can not be run at IN2P3
    • JB: They made a reasonable decision to get something up and running quickly. It was not recognized that this was a Minimally Viable System, and not production ready
    • TJ: On reviewing, they are following procedures. But, none of them are pipelines developers and they are doing reviews in a closed group.
    • WOM: Was there an RFC to add to lsst_distrib
      • TJ: Yes, but the problem is that after the RFC is approved new algorithms can be added and can blow up the memory requirements
      • RHL: When this became clear it was a problem, we took it to a Faro meeting and worked out some fixes. The larger problem is how slow it was to provide that problem report
        • CS: Partially, but there is more than one issue
        • YA: If Faro was part of pipelines, as soon as Brock reported that a component could not be run because of memory, someone would have been on it immediately. JB and YA would typically make the call whether to drop an algorithm from the monthly DRP reprocessing
    • KT: The problem of cross-team reviews is endemic. Do we make a policy that 1/N reviews should go to someone external?
    • JB: For decision making, YA and JB can make decisions based on the scientific algorithms, but don't have insight into the cost of computing. We need to have someone on the other side pushing back to say that running X is too expensive
    • CS: We have to think about the performance of the pipelines in every aspect now. That this was a problem for RC2 was a signal that this is a problem. Discovering problems is a way to get action taken.
      • WOM: Should there be a processing budget
      • TJ: We record quantum execution time and peak memory usage.
        • We have different nodes with different memory limits, some are pre-emptable and some are not
      • EB: AP has been watching metrics for speed and memory use for some time
  • WOM: asks CS if the QA group can ensure that they ask for reviews outside the QA group.
    • CS: We can make sure that we are not inventing arbitrary barriers to integration
    • WOM: Need to emphasize that reviews are not nit-picking them, but are trying to make the code better
    • CS: not so much a reluctance to take the review feedback, more a reluctance to take up DM experts' time to assist making reviews and making the improvements.
    • YA: No disagreements, but we should postpone until Leanne is available.
10:00Build systemDo we want any changes as we reimplement the build system at SLAC?  How can we improve reliability and maintainability?  What outputs are needed that we don't currently have (e.g. images from CI runs)?  Can we simplify and combine any existing processes (e.g. separate lsst_distrib, ci_hsc, ci_imsim, nightly-release runs)?

Need to stand up a brand-new instance of Jenkins at SLAC

  • Current Jenkins runs on AWS, is very out of date
  • FE: we should use cloud Macs, not run our own machines in the future
    • Cloud Macs were not used before, because they didn't use to be available
  • KT: It would be possible to keep Jenkins in the cloud, but don't know the cost
    • FE: We really need a build engineer to make these decisions
      • In the past it was expensive to run Jenkins, because it became a workflow system.
      • KT: So stand up Jenkins on the cloud, have it trigger CI running at SLAC
    • FM: The USDF staffing plan has a build engineer allocated (we just have to hire them)
    • JB: Agree with FE that we need to get the CI tests out of Jenkins
      • The big limitation is that we need to have a build and install system that runs the exact build from Jenkins with the CI at SLAC
      • KTL: Doing the builds and the small tests on the cloud is fine.
      • TJ: We should consider moving away from test data repositories, but instead get the correct data from the butler
      • KTL: Storing and transferring data in the cloud is expensive
  • Google Cloud
    • FE: Square is leaning towards GitHub Packages. KTL: enterprise size only seems to go up to ~ gigabytes, doesn't seem sufficient
  • NCSA workers and SQuaSH moving to SLAC
    • FE: SQuaSH should not be an issue, it's moving to the Science Platform language. SQuaSH will use the EFD
  • Transition: 
    • Build in both places for a while, but only publish from one? Or publish from both to different places?
      • FE: SQuaSH has been using Jenkins job IDs, but there is no such concept in the new system. If we can make sure we have unique identifiers with the runs we can easily track history
      • JB: Currently writing a Butler technote on tracking history
  • Simplifications
    • Many layers to the current build system
      • CS: Easy to understand GitHub actions, hard to understand the current scripts. That makes it harder to maintain
  • Additions
    • Outputs from CI runs need to go into a Butler repo? Perhaps we can use code from the OODS to clean up old RUN collections.
      • JB: Would be very helpful to be able to run tests (like unit tests) that are based on the results from the last processing run
    • IS: Would be very helpful to have nightly build artifacts going back a couple weeks, and weekly build artifacts available for the past few months to a year. If this was accessible through the butler, that would be easy.
10:30Break

Moderator: Wil O'Mullane 

Notetaker: Kian-Tat Lim 

11:00big files and little files and little pieces of big filesslides to kick off discussion

S

  • Most files are under 0.5GB, a few up to 7-ish.
  • Dominated by things under 100KB.
  • In some cases, multiple small files are already aggregated into larger ones by PipelineTasks.
  • Can swap underlying storage in Butler, but don't know what to change to.
  • Users want image cutouts and catalog column subsets; this will be less performant from object stores unless something like Arrow can be used to do partial reads (cfitsio cannot).
  • Hard to avoid writing small files; could upload to SQL databases, then export into larger files.  Or use non-SQL database.
  • Need to sync at all sites.
  • Metrics are many of the small files; they could go directly into a database as they're not desirable as small files in the long run.
  • TJ: Butler composite disassembly could make even more small files in order to optimize retrieval of pieces of complex datasets.
  • TJ: HDF5 has a remote server option to support cutouts.
  • Caches might help.
  • Could move away from FITS.
  • Write our own "read FITS subimage" routine? Harder with compression.
  • Maybe put byte offsets into the Butler database to more efficiently retrieve pieces?
  • TJ: Someone on astropy mailing list was working on a FITS file index.
  • CS: Sequential vs. random access costs can be surprising; don't always get gains you think you would.  Need to measure empirically.
  • SRP: Has any profiling been done?
    • JB: Only for Parquet files; not for FITS.
  • KT - some stuff will be loaded in database - even if in files before, there are infra solutions that might help, e.g. Weka file system POSIX in front of object store (caching and POSIX semantics) it can also package lots of small files in an object. Could be made transparent to pipelines - but others may not adopt and so may work only at SLAC. Could package some small files in butler using TAR - but  WOM - thought this was HDF
  • YA: Opportunities to benchmark soon.  Spark can likely already access Parquet efficiently (if we can make it accessible to frameworks like that); we'd need to make the Butler do this.
  • FE and GPDF: Will need to offer a service that returns pointers to Parquet files. Would be great to also enable code to retrieve parts of such a file.  Firefly can deal with Parquet.  TAP is bad for large results.
  • JB: Parquet mostly solved, need to use it; FITS is a potentially looming problem, need to benchmark.
  • JB: Should Pipelines continue aggregating files in PipelineTasks?
    • desirable and reasonable to aggregate files and metrics in parquet for now
    • no sense putting in DB unless you want it in a DB for query - not if you are going to dump
    • if we want to query we SHOULD put in butler DB, campaign management shows there will also be data outside butler needed for graph building, butler needs to join externally provided lists of data ids.
  • JB: Worried that the access patterns for many small files don't align with write patterns; transpose needs to occur.
  • RHL: Logically want separate databases even if they live in the same Postgres instance.
  • SRP: Need to tune filesystem properly, especially for small files.
  • CS: Mitigating performance limitations may still leave an access issue that needs to be dealt with.
  • TJ: Need to query on metrics
    • But doesn't mean it has to be in Butler
      • Kian-Tat Lim Convene a meeting with Colin, Tim, Robert, Yusra to resolve graph generation with per-dataset quantities (likely based on Consolidated DB work).  
11:20Little tasks vs big tasks


Brought up in the morning session and relevant to previous topic. News that e.g. preemptable queues cost 4X non-premptable queues affect the tradeoffs in pipelines design.
  • Longer-running tasks are more costly than shorter-running because they can't use preemptible nodes
    • But shorter jobs have more overhead
  • FE: Cloud could be a good environment to do optimization.
    • WOM: Would prefer to do on USDF rather than IDF.  Jim Chiang may be of assistance as well.
  • JB: Want some kind of change control process that allows communication of performance/cost tradeoffs.
    • DP0 has a "what goes into the pipeline" committee
  • CS: Changed size scales dramatically for DP0.2, brought new issues to the forefront.
  • YA: Want to understand the costs and constraints up front (e.g. of DFs).
  • FE: Who is performance guru?  Early on, SQuaRE was going to some of this.  CI was going to calculate key performance metrics as part of characterization reports for releases, but we were never able to do them in a realistic batch system.  If we run CI processing outside Jenkins, we can gather useful metrics.
  • JB: Need to validate performance of production.  Need "product ownership" of computational performance to say "you can't have that".
    • WOM: Possibly will get push back from USDF when we start running there.
    • YA: Reviewer suggested making a budget for each DR and staying within it.
  • CS: IDF has too much scalability and CPU-to-memory ratio flexibility.
  • FM: USDF will have batch processing hardware in the next few months.  But we don't have a task to tune the pipelines against the node types at USDF.
    • RHL: What feedback is needed?
    • FM: E.g. how much memory is actually needed and how do we make it fit.
    • RHL: Large processings during Commissioning at sizes that will stress the systems should provide this.
  • YA: Reports coming out of processings with CPU-hours and memory usage are already there.  Need to think about hardware-level optimizations.
  • SRP: Run at IDF with nodes similar to USDF.
    • MB: But there are other things that can't be emulated directly.
11:45Image services (esp. for DP0.2)Gregory Dubois-Felsmann / Frossie Economou

Define the remaining missing pieces needed to deliver acceptable image services for DP0.2 - and beyond - and ensure that they are all assigned to specific groups/T-CAMs

Slides for discussion

(See also slides from June 2021 DMLT)

  • Want to provide image services that are useful for Commissioning (rapid turnaround) and thus AP as well as static DRs.
  • Required:
    • ObsCore data model — prototype never deployed for generating this; "assembly line" to go from pipeline outputs to data and into ObsTAP below not done
    • ObsTAP (ObsCore in TAP) — prototyped with WISE data, but not integrated with Rubin data
    • Image retrieval via HTTPS linked to ObsTAP — not done
    • SODA cutout service — in work and demonstrated, being refined
    • DataLink annotations to ObsTAP for SODA etc. — basic capability developed for CADC TAP
    • HiPS service — algorithmic work done but blocked by middleware to make it a PipelineTask and not integrated; also do not have code to generate needed metadata and organized tree
  • JB: With regard to new architecture for "calculational image services", having Butler be one interface for all I/O may make it too complex
  • FM: Meeting latency requirements is a concern.  How is that being addressed?
    • Russ is working on an eventual solution with a persistent process with a pool of servers.
    • The calculations are done with a library, not a PipelineTask.
  • FE: #1 problem is the "superhighway" from outputs to ObsTAP.  No one currently owns Felis.
    • TJ: GPDF is the data engineer in Ops. SLAC has one (unhired) person in the plan to support that.
  • WOM: Exposing a db view performance issue with thousands of uses. GPDF: it's read only and the view can be materialized for the public release catalogs
  • geometry from 
    • FE: big change over baseline, consolidated DB is not firm, do no have low latency source of truth for image metadata, so this would be good. Also we are running out of time - worth trying. Currently 4 ways to represent 
    • RSP use don summit but has no way of layering services in front to say browse data in real time - this approach of Gregory would allow this.
    • RHL: need a real database on summit e,,g camera trending (not just RSP butler not sure what he means) partly afraid this will lead to pushing more in butler
12:30Break

Moderator: Leanne Guy 

Notetaker: Frossie Economou 

Please note polls for June and Oct meetings

June: https://doodle.com/poll/ggwsy4hvm7y62ywc?utm_source=poll&utm_medium=link

October: https://doodle.com/poll/cd37rsi7ghbnx2ce?utm_source=poll&utm_medium=link

1:00

Status of Prompt Processing

Kian-Tat Lim (requested by Eric Bellm )

DMTN-219

https://dmtn-219.lsst.io

  • (prototype on Google cloud off-slide - uses Cloud Run)
  • RHL: What is that 1 parameter? KTL: the number of visits to upload. 
  • RHL: Does this bypass the OODS? KTL: Yes, and OODS also gets a copy, and may be triggered by writing to a different object store on the summit rather than one at the USDF.
  • WOM: Auxtel too? KTL: Yes
  • SRP: How do you know when all the snaps have arrived? KTL: Because I am told how many snaps will arrive in the next event. I wait for that number.
  • KTL: This design is with nightly release in mind. If you need quicker turnaround, service can be in a container with stack mounted elsewhere. 
  • KTL: OK Prototype is done. Shouldn't be too much work to get wr=orking at the summit. Of course modulo finding someone to work on it. Maybe someone in AP?
  • Ian/Eric: We weren't expecting to write anything other than the payload,  but are not surprised that there's no one else. 
  • ECB: KTL has given us a design. What's needed for us to accept it and start building it?
  • CTS: Someone actually needs to use it. 
  • ECB: Does anyone what to comment on the technote before we do that?
  • WOM: I'd like to investigate WBS and see where this was intended to be. The correct place might be in at the USDF, but we don't have anything. 
  • KTL: I just need someone to set up a butler with some calibrations. 
  • FM: There is some other soggy ground of stuff falling between the APDB and Alert pipeline. Do you need an APDB?
  • ECB: Yep FM: Andy can help with that.
  • WOM: Where do the metrics go?
  • KTL: they get put into the butler, and then there's a separate task that will publish them via kafka. In this case the metrics would go to the summit and go back the EFD. This is something that I negotiated with Tiago. 
1:30Build system

Do we want any changes as we reimplement the build system at SLAC?  How can we improve reliability and maintainability?  What outputs are needed that we don't currently have (e.g. images from CI runs)?  Can we simplify and combine any existing processes (e.g. separate lsst_distrib, ci_hsc, ci_imsim, nightly-release runs)?

Moved to 10:00

14:00 Close

Day 2, Wednesday Feb 16 2022

Moderator: Leanne Guy 

 Notetaker: @womullan






9:00USDF Deployment 

pdf - Richard


if there is news to share from Richard Dubois on hardware deployment at the USDF; otherwise defer to a future DMLT

... maybe this is just a readthrough of RTN-021

Slides cover details of orgcharts and planning for USDF. Ref RTN-021 and DMTN-189.  Assumptions slide quite good.

SDF currently shows 7-10K cores available (of 15K) over the last week.

RHL: HSC data is RC 2 - Yes Brandon will bring it Jim idientifed it

EB: Whats distinction between Batch and K8s. RD is a palatee of tools we are trying not to tie it down - LLS will have a petaflop and much of the time will only use 30% of it. We can interface to Condor, etc ..  we can have a prio set of machines for RUbin - then spill out to others.

RHL: Support of Latiss Taking data already every month - Prompt being set up for it?  RD: had not heard of it.

KT: Anticipating Prompt would be at USDF on K8S .. wheter its operational and routine may take until Oct. But there should be something. FOr Eric - for devs typically they will log in a similar dev node in USDF. RD: open to whats bewt for dev.

CS: Login node is just authentication - no one cares. We care about the compute node, RD: SDF login nodes are beefier and could be used as development nodes. SP: NCSA login nodes are only jump of to dev modes but the HPC nodes are submission nodes to batch. RD: others have had dedicated developer nodes - we can optomise together.

9:30PanDA evaluation/selection

pdf - Richard


I've heard slightly conflicting reports about whether PanDA is definitely what we'll use for batch processing vs. something we're still evaluating (and if the latter, who is deciding, and when).  I have some guesses about what might make us prefer it over HTCondor (or Parsl?), but they really are just guesses.  I'd like to understand what the roadmap is here, both for long term use in operations and what our developers will be using after the move from NCSA to SLAC.

Path to processing - this is what Fritz is doing now.

CS: Org chart .. who is responsible for nightly processing on there? RD: Execution team ?

And when it goes wrong ? Alerts fall over - they should know and call the correct. person.

Not intending to run shifts at SLAC - but we are hoping to use European effort for out of hours - they would need training.

RHL: Feb looked very interesting (Fritz end of Feb is almost April) so when is fist RC2 ? FM: cobling some bits together with JimC - using DC2 data already at SLAC - so end feb indeed bu tit would not be the regular processing.  DP0.2 peopl are shaking out a lot of PanDA issues and we want to take advantage of that - HSFC will also be back in April so  perhpas about then. Prototype Feb, make robust in March - could have something in April (no promise)

RHL:Campaign management - RD: is staffing up now - so come back in May.. we are nowhere on this

KTL: Rucio - we do want to start exec across 3DFs at some point so we need rucio but not in the initial runs (which would be like NCSA). What did Colin originally expect for execution ? DRP and AP are toe two big components often concurrent - RD:execution has 8-10 people in ops but not now.

FE:Several inKinds offered out of hours .. Japanese perhaps. RD: have been trying to ping Phil to see what kind of inKinds we can get.

YA: Colin are you worried keeping all the processing running belongs elsewhere only that its too small ? CS: there are boxes for responsibilities - would like ot see alerts and DRP separate. YA: but its one group now.

YA: thought there were shifts for exec. RD: we are hoping to cover it with europeans etc WOM: notes he set up some of this and there are identified Alert and DRP people in Exec. Will talk with HFC when she returns on how this is orgnaised exactly.

JB: adopted panda since we anted to orchestrate acorss many sites and its the only game, but PARSL and Pegasus are more DAG oriented: KTL looking at wayy to build graphs and submit to sites and push up to campaign management but for now still need PanDA features.

LG: any plans to test Parsl lliek a DP0.2 test?  RD: not aware of any - DESC are doing there reprocessing DC2 with PARSL. Also CC-IN2P3 are doing DC2 with PARSL - PARSL has not done large scale processing, multisite week and understnading whats going on is weak.  TJ: keeping BPS general..Condor is tested .. Panda is tested ..  SP: NCSA poepel meet with HTCondor monthly – they are set to help out.  Wei: HTCondor and PanDA ar enot same level of thing .. Condor more at batch level can use something for workflows - PanDA does this. You could use HTcondor at SLAC. 

JB: BPS HTcondor Plugin is somehting we should have and PanDa BPS was at wrong level ? TJ: no PanDA needs to be told what the DAG is etc - it could drive Condor. Persite config of PanDA for

10:00No-meeting holiday periodEric Bellm review experience with December-January no-meeting period

see also https://docs.google.com/document/d/1ET1DS3VUvrgoHn3vxUIO6inT2ONYHYuEWHSZUlh6I3o/edit#heading=h.toi68rogouu0


MB: loved it - started early ... and went long people were ready to back on 8th.

KT: Need to decide the number of weeks

FE: delighted :  Agree with MB - 2 weeks more often may be better. 4 weeks too long for missing meetings also

YA: AAS meeting free week was exceptional  - mentioned Ops initiative (above)

LG: indeed AAS was special case - 2 weeks would be great. CERN gives extr days on top of vacation. Woudl impove productivity ..

FM: team liked it - SLAC shut down is good was great to have it broader DM. 3rd week was great 4th week was a bit long -2 weeks is a good min.

IS: some negative feedback - mainly on messaging. Developers who do not have that many meetings were worried it was enforced quiete period - a little unhappy about cancelled meetings as its a checkn and chance to see people: Personally liked it but was also sick. Within local group we should have kept some meetings like metrics ..missed some things

JB: Important to cancel meetings not just push them off. Focus on long term projects and skip maintenance for a while

RHL: in favor of this. Piling up end of year (Noir people take unsed vacations)  - 2 weeks of Christmas and 2 weeks another time would be better

FE: COVID made the use iot or loose it worse. . Disappointed we did not get the messaging correct - was not focus friday and not a meeting ban - people still met .. Aquare standups people ipted in. People got more of Frossie's time on specific projects/devs.

CS: was confused . FE: Simple heristic - if I look at Feb 2023 there are a bunch of meetings - those are the ones we want to kill off ..

KTL: not being able to go away because you need to report etc - does'nt this mean we have a problem with delegation?

FE: decision process may be the problem - forces use of RFC etc .. pushing more to geekbot. Better than other porjecs but still poor ..

CS: decision tree process for meetings or not but what about others.

WOM: To suamrise - 2 weeks is the answer - perhpas 2 times a ear and another 2 1 week periods. Communication on that can be improved Colin/Ian should challenge me more on the wording next time..

10:30   Break

Moderator: Frossie Economou 

11:00

Milestone Parade

Wil O'Mullane 

as we did last time - look over overdue and near term milestones .

https://docs.google.com/spreadsheets/d/1TUIUf84qHX5QfcCNWs27HGKlgHmCKcCs1IpBP5ODNDA/edit#gid=0


Also Milestones LCR .. I think i have it tidied:

https://docs.google.com/document/d/1LbJBg3thTHkcSMXkdUs1nhrRRKlee1cvPasr0XArghM/edit

DMTN-158 updated as of this morning.  Milestone retirement is still on its old trajectory, though the scheduled "cliff" has fallen away below us.  Need to do better.

DM-AP-10 (SFM2): Ian Sullivan waiting for analysis needed in order to determine whether ready to claim.  Affected by departure of Chris M.  Robert Lupton Not a formality?  Doesn't this link to whether we need full-focal-plane astrometry? => Leave as-is for now.

DM-AP-12 (DIFF3): Ian Sullivan actively working on this one, still expect to complete before end of year, more likely 3-6 months => Revisit at next DMLT.

DM-AP-14 (ALERTDIST2): Eric Bellm Pending decision on whether we can claim external brokers (esp. Antares) as fulfilling this.  Need to understand how to satisfy SRD requirement.  Do we "claim it" or "descope it" in order to retire it?  Zeljko Ivezic Need to flow resolution through the PST.

  • Eric Bellm and Leanne Guy to recommend and pursue a resolution for DM-AP-14 based on Antares.  

DM-AP-15 (INTEGRATION1): Eric Bellm Spencer has put together a full system in the IDF that brokers have actually connected to.  Not at USDF and not connected to an actual AP pipeline.  DM-AP-16 requires USDF.  Frossie Economou advocates claiming this based on the IDF deployment.  LCR the wording to say "at a Data Facility" to cover this.

  • Eric Bellm Look at recasting the DM-AP-15 test plan to run it against the IDF deployment.   

DM-AP-17 (MOPS components): Ian Sullivan will be moved as part of the currently active milestone LCR.

DM-DAX-13 (GEN2 retired): Tim Jenness Should be a summer-2022 event.  Is the definition "it has been removed" or "it can be removed"?  Lauren signed off on science output equivalence between Gen2 and Gen3 - with even more outputs now in Gen3 than in Gen2.  Decision: can be claimed in the February monthly report.  Does not need to be "it has been removed from a release".

DM-DAX-14 (Provenance system review): Frossie Economou This is done.  DM-DAX-9 will cover only Butler provenance.  Where is the rest of it?  Will claim DAX-14 in the January monthly report.

  • Frossie Economou Will recommend additional Level 3 milestones for implementation beyond just the DAX-9 Butler provenance milestone.   

DM-AP-16 (INTEGRATION2): Ian Sullivan Will be moved forward by the current milestone LCR.

DM-DAX-9 (Butler provenance implementation): Add to current milestone LCR and tie to DP2?

  • Tim Jenness Make sure DAX-9 milestone is retitled to make clear it's Butler-only.  

DM-DRP-24 (DRP-MS-IMCHAR-3, PSF): Yusra AlSayyad To be moved by LCR

DM-DRP-29: Yusra AlSayyad  will be done in April

DM-DRP-31: Yusra AlSayyad  in LCR needs training data from LSSTCam

DM-DRP-36: Yusra AlSayyad  make successor of AP

DM-DRP-37: Yusra AlSayyad  not done, no action

DM-NCSA-21 (processing for ComCam): Unknown User (mbutler) see notes in spreadsheet, does not seem to be ready to claim

DM-NCSA-23: Unknown User (mbutler)  DP0.2, will be claimable by June

DM-NCSA-14: Unknown User (mbutler)  Rucio, to be done in next month or two

DM-SQRE-6 (SODA): Gregory Dubois-Felsmann Will be claimable based on functionality delivered for DP0.2.

DM-SQRE-7 (image services): Gregory Dubois-Felsmann no change, DP0.2 is not enough, perhaps late 2022.  Frossie Economou May have to push out further.

LDM-GEN3: Leanne Guy Running tests now, planning to wrap up tests by end of month.  DMTR-271.

LDM-503-12: didn't capture notes, covered by new LCR

LDM-503-15a: didn't capture notes

Upcoming milestones

DM-NCSA-26: LCR and transfer to be a USDF/SLAC responsibility.  Unknown User (mbutler) will check to see if this is in a pending LCR already.

DM-PORTAL: Frossie Economou Ready to claim, has been done

DM-SQRE-4: Frossie Economou Claimable now.

DM-SQRE-5: Frossie Economou Connect to end of the mini-surveys (effectively DP2)?  This is a capability beyond LDM-503-RSPb.

  • Frossie Economou Recommend a change to the schedule / dependencies of DM-SQRE-5.  

LDM-503-12a: Need to associate with LSSTCam-on-sky.  Push out to be around CAMM8090 (Camera pre-ship review at SLAC)?

  • Wil O'Mullane Find an appropriate commissioning/I&T milestone to connect to LDM-503-12a.  Camera acceptance test?  CCOB test on summit?  

LDM-503-13 (ops reh for DRP #1): Wil O'Mullane Intended to claim based on the functionality needed to deliver DP0.2.  Yusra AlSayyad Consistent with review story that we are making up for delays in actual hardware by doing DM testing against the DPs.

LDM-503-14/15/16/17 are all in the current milestone LCR, tied to COMC-0100.  Frossie Economou Worried about the scale of the test plans required for these milestones - do we have time to even write these tests?  Tim Jenness This is effectively the test plan for the DMSR (LSE-61) itself (summed over all these milestones).  Kian-Tat Lim LDM-503-14 should be just the Priority 1 parts of the DMSR.

Leanne Guy Maybe we need a milestone to represent the subset of DM functionality needed for ComCam?  Robert Lupton Base it on actually processing AuxTel data.  Wil O'Mullane That would just be the priority 1a milestones.  LDM-503-14 would stay as is and would cover both priority 1a and 1b milestones.

Further discussion leads to reusing LDM-503-12a as the milestone for completing all the priority 1a requirements, and connect it to ComCam-on-sky in the schedule (currently May 2023).

Decision: -12a milestone will have been claimed at ComCam scale, then we will use the verification-monitoring process to retest at LSSTCam scale.

Frossie Economou notices that there are at least some 1a requirements that are not really properly needed for commissioning (like a full-user-community-scaled archive).  Need a review / LCR for priorities.

  • Leanne Guy Coordinate a round of revision of DMSR requirement priorities.  Input needed from all product owners.     Will need LCR to implement.
  • Frossie Economou and Gregory Dubois-Felsmann Show LDM-503-RSPb as a predecessor of LDM-503-14, and tie it to an appropriate SV milestone in the commissioning plan to shift its date away from -RSPa.  Add to LCR.   
  • Wil O'Mullane Pull together final version of milestone LCR, show at February 28 DMLT if possible.  
12:00Status Part I

>>> presentors = ['Leanne', 'Cristian', 'Frossie', 'KT', 'Fritz', 'Michelle', 'Ian', 'Yusra']

>>> random.shuffle(presentors)

['KT', 'Yusra', 'Michelle', 'Ian', 'Cristian', 'Frossie', 'Fritz', 'Leanne']

(Not taking notes on what's on the presented slides, only questions/discussion.)

Arch:

A little additional discussion of the decision to defer "client-server Butler" work.

KTL's slides triggered a vigorous side-discussion of "who owns campaign management?".  Not a construction deliverable.  Clearly a USDF responsibility as an Ops deliverable.

Evolved into a discussion of "what is campaign management?" - there seem to be two things being described under the same name, one of which doesn't yet have a clear written description.

  • Jim Bosch Will write a short tech note on "what Robert wants" re: campaign management to allow concrete review to proceed.  Existing documents in RTN-023 and DMTN-181.  
  • Robert Lupton , Richard Dubois Will either sign off overtly on JB's planned note on campaign management or write a separate note that sets out the needs that he has identified.  

DRP:

Brief discussion of how to cover the coarser levels of the HiPS grid.


12:30

Break

Moderator:@womullan

Notetaker: Kian-Tat Lim 

1:00


Status part II

NCSA/LDF:

MichelleB is worried about moving data to SLAC and other transition issues

Some components from NTS are shared with other equipment; won't be shipped until Sep

Must get USDF running well before mid-August (aiming for June)

WOM: All data should be moved; could put some on tape and require asking for it to come back

AP:

WOM: Will Prompt Processing test measure the end-to-end alert delivery timing? Yes; taking the place of other planned measurements; profiling/optimization will happen in the future

IT:

Cristiàn is unavailable

SQuaRE:

No longer using "SUI"/"SUIT" terminology; use "Portal" instead

GPDF and FE to work on proposal for resources in the RSP

Burwood is contributing technical effort to cover some of shortfall due to Simon's absence, but not everything can be covered in a timely manner

Ways of non-interactive notebook execution: mobu for testing, Argo Workflow for "recommended", eventually "notebook batch".  Developing new service to ask nublado to execute a notebook (with parameterization).  Useful for DP delegates who only want to look at the results.  Times Square presents results as a web page (caches to avoid re-execution).  Also usable for end-of-night reports.

DAX:

Qserv user-generated products has dependencies on things like understanding what if anything will happen with VOSpace

LG: SST trying to understand definition of done for DAX in Construction; will support DAX verification

GPDF: Is a waiver needed for user-generated products? Yes, build with Ops money

Science:

Special Programs support with non-standard pipelines generally moves into Ops, but need to make sure that RSP can handle SP products

Expect MW Gen3 verification campaign to be done by end of Feb

1:45Wrap up


Day 3, Thursday Feb 17 2022  (Doesn't look needed as of Feb 7)

Time (Project)TopicCoordinatorPre-meeting notesRunning notes

Moderator:

Notetaker:

09:00













12:00Close




Proposed Topics

TopicRequested byTime required (estimate)Notes
Status of Prompt ProcessingEric Bellm 30requesting status report/alignment from Kian-Tat Lim 
USDF deploymentEric Bellm 30

if there is news to share from Richard Dubois on hardware deployment at the USDF; otherwise defer to a future DMLT

... maybe this is just a readthrough of RTN-021

No-meeting holiday periodEric Bellm 30review experience with December-January no-meeting period
milestone parade30-50as we did last time - look over overdue and near term milestones ..
Missing functionality30-50Chucks list Missing Capabilities needed by Commissioning
QA30Quality of QA additions to lsst_distrib e.g. inability to run at IN2P3
big files and little files and little pieces of big filesJim Bosch 30

Pipelines is starting to optimize its own I/O by merging small files into bigger files at various points.  RSP services are starting to read small subsets of our biggest files, and generally assuming that this is faster than reading the entire big file.

Are we fooling ourselves testing these on GPFS instead of object stores?  Do we actually need to be writing more small files, or fewer bigger files?  Does Rucio care?

Image services (esp. for DP0.2)30-50Define the remaining missing pieces needed to deliver acceptable image services for DP0.2 - and beyond - and ensure that they are all assigned to specific groups/T-CAMs.
Build system20Do we want any changes as we reimplement the build system at SLAC?  How can we improve reliability and maintainability?  What outputs are needed that we don't currently have (e.g. images from CI runs)?  Can we simplify and combine any existing processes (e.g. separate lsst_distrib, ci_hsc, ci_imsim, nightly-release runs)?
PanDA evaluation/selectionJim Bosch 20I've heard slightly conflicting reports about whether PanDA is definitely what we'll use for batch processing vs. something we're still evaluating (and if the latter, who is deciding, and when).  I have some guesses about what might make us prefer it over HTCondor (or Parsl?), but they really are just guesses.  I'd like to understand what the roadmap is here, both for long term use in operations and what our developers will be using after the move from NCSA to SLAC.





Attached Documents


Action Item Summary

DescriptionDue dateAssigneeTask appears on
  • Frossie Economou Will recommend additional Level 3 milestones for implementation beyond just the DAX-9 Butler provenance milestone.   
15 Mar 2022Frossie EconomouDM Leadership Team Virtual Face-to-Face Meeting, 2022-02-15 to 17
  • Kian-Tat Lim Convene a meeting with Colin, Tim, Robert, Yusra to resolve graph generation with per-dataset quantities (likely based on Consolidated DB work).  
18 Mar 2022Kian-Tat LimDM Leadership Team Virtual Face-to-Face Meeting, 2022-02-15 to 17
  • Frossie Economou Write an initial draft in the Dev Guide for what "best effort" support means  
17 Nov 2023Frossie EconomouDM Leadership Team Virtual Face-to-Face Meeting - 2023-Oct-24
  • Convene a group to redo the T-12 month DRP diagram and define scope expectations Yusra AlSayyad 
30 Nov 2023Yusra AlSayyadDM Leadership Team Virtual Face-to-Face Meeting - 2023-Oct-24
11 Dec 2023Gregory Dubois-FelsmannDM Leadership Team Virtual Face-to-Face Meeting - 2023-Oct-24
02 May 2024Frossie EconomouDMLT Meeting - 2024-04-22
22 May 2024 DMLT Meeting - 2024-04-22
  • Richard Dubois USDF part in data facilities for PSTN-017 and distrib processing ? 
22 May 2024Richard DuboisDMLT Meeting - 2024-04-22
22 May 2024Fabio HernandezDMLT Meeting - 2024-04-22
  • Tim Jenness - section on middleware for PSTN-017  
22 May 2024Tim JennessDMLT Meeting - 2024-04-22
  • Cristián Silva - section on summit/data acquisition  for PSTN-017  
22 May 2024Cristián SilvaDMLT Meeting - 2024-04-22
  • Cristián Silva if you come to SLAC lets have plenty of photos of locks and racks etc ..
Cristián SilvaDMLT Meeting - 2024-04-29
Richard DuboisDMLT Meeting - 2024-04-29