Logistics

Date

  – 


Location

This meeting will be virtual; details of the teleconferencing system are to be determined.

Join Zoom Meeting

https://gemini.zoom.us/j/92414793405?pwd=QU9ueHdrNzJ3ZVFMbDB0Uk1sU09ndz09

Meeting ID: 924 1479 3405
Password: 210622

Attendees

Wil O'Mullane

Frossie Economou

Gregory Dubois-Felsmann

Colin Slater

Cristián Silva

Eric Bellm

Fritz Mueller

Ian Sullivan

Jim Bosch

Kian-Tat Lim

Leanne Guy

Unknown User (mbutler)

Unknown User (npease)

Richard Dubois

Robert Gruendl

Robert Lupton

Simon Krughoff

Tim Jenness

Yusra AlSayyad

Regrets

All Times PT.

Day 1, Tuesday June 22

Time (Project)TopicCoordinatorPre-meeting notesRunning notes

Moderator: Kian-Tat Lim

Notetaker: Ian Sullivan

09:00WelcomeWil O'Mullane
  • Introductory remarks
  • Review agenda and code of conduct

09:15Project news and updatesWil O'Mullane
  • RHL: There is a possibility of travel down to Chile as well as to the office.
  • GPDF: At Caltech starting next week we can go back with almost no restrictions, including having visitors and using meeting rooms. Mandatory return to office will be September.
  • Aura is not asking about vaccination status; Princeton, SLAC, UW will require vaccinations. Caltech is requiring reporting vaccination status, vaccination likely required upon FDA approval.
  • Camera official delivery date is August 19. RHL: not actually done at that point, they will still be tinkering with the voltages. WOM: we can't realize or burn down the camera-related risks until they're done modifying it.
09:30Community broker selectionWe are still resolving a few last issues in the SAC's preliminary broker selection report; no DMLT decision-making is needed right now.
  • KTL: Does the hybrid alert model make a difference in how many brokers we can support?
  • EB: We asked the broker teams if they wanted a subset of alerts, and they all requested the full stream with full alert packets.
  • TJ: This means that every broker wants the full postage stamp as well?
    • EB: Yes, though many have said they would be OK with a service to look up the images.
  • LPG: Steve K wants to answer this as an Operations question. We're not going to commission 6 and then run 5 in Operations.
  • KTL: It's also only a problem if everyone wants it at the same latency.
    • WOM: The six all do want all the data, within the 60s window.
10:00Conda version pinsJim Bosch

Doing less pinning in our conda envs lets users install their own things on top at the expense of reproducibility. Could we start providing both pinned and unpinned versions of each conda env release? I think it's time to admit that we cannot satisfy all consumers with either minimal pins or maximal pins or even a carefully chosen balance, but I'm hoping we can simultaneously support two envs that each try to satisfy different consumers just as easily.

KTL: We already have this. The only thing that's lacking is an easy way to create a newinstall environment with the fully-pinned versions. My version of Gabriele's lsstinstall script (currently on a branch of lsst/lsst) intends to provide this. Also note that stack (not RSP) containers are effectively pinned unless someone installs something on top.

(#1: reproducible, #2: extensible)

  • JB: If we require a reproducible stack, we may need to include a few additional packages in order to support users
  • KTL: Cleaning up Gabriele's lsstinstall script. Working on it this morning, since Mario might be able to make use of it now.
  • KTL: Prior to conda 4.9 anything you installed required the versions of any new packages matched the versions of all dependencies exactly. New versions allow you to install additional packages and update the versions. But, now you have lost complete reproducibility.
  • KTL: The shared stack is a different problem: we can install additional packages that make developers lives easier, but we also want a minimal development environment without any additional packages.
    • We can possibly do this for the shared stack, but not for the binary installs.
  • RHL: Why is this a problem, if I just install new things it shouldn't require changing the build of the stack?
    • FE: If a user pulls in a new package, it frequently includes updated dependencies that are already in the stack.
    • TJ: We have loads of flexibility. The problem is that we depend on lots of python packages and if a new package needs a newer version of a package that might break us
    • KTL: Two ways to go about it: freeze all dependencies, or allow it to float.
    • KTL: There are ways to add packages to the lab containers or the shared stack, as long as they don't lock in incompatible versions.
    • TJ: Are you thinking of doing Rubin-extra in addition to Rubin-env?
      • KTL: Yes.
  • RHL: I would like to move away from the expectation that we tell people they must install their own packages in the RSP
    • FE: There is a process for people to add stuff to containers.
    • FE: In Operations, the RSP is a very slow moving environment tied to official releases.
    • RHL: I don't understand why we aren't more user friendly with our containers.
      • FE: We have to determine whether many users need new packages, or if it is just us/RHL. This is what the Data Previews are for, to determine what real users in the wild will need.
    • FE: The emerging model is that there are varying classes of deployments. For the Telescope environment, we might make the trade-off that we allow users to change the underlying configuration which might break it for everyone, in exchange for rapid development. For the science users, we need an absolutely stable environment.
    • KTL: It is possible to give users more of a choice, but it means we have more complicated builds.
    • RHL: It is great that the Telescope team may have a flexible environment, but I worry that will grow to include the entire commissioning team.
    • FE: My preferred model for Operations is that we have a separate enclave on the Data Facility for developers and one for the thousands of science users.
  • KTL: We need to have our standard Rubin-env for stable releases, and Rubin-env-extra for the additional packages.
    • WOM: We might need different Rubin-env-extra environments in different places.
    • KTL: That should be OK, we can have multiple sets.
    • JB: I'm willing to live with flexible notebooks that don't guarantee reproducibility, as long as I can always get a minimal build that does guarantee reproducibility.
    • FE: The problem is that some packages have dependencies in common with the stack, though it's rare. An additional problem is that we don't have a build engineer, so we don't have a dedicated person to solve this.

Pop-up topic: Are people happy with Gen 3?
  • LPG: I am very happy, and hear from a lot of scientists that they are 
  • SK: Very happy overall, but we must solve the error message when we get an empty quantum graph. It is hard to step through all the data sets to find what is missing, and often it might be something in a late stage so many tasks could have run successfully before the missing part was missing.  KSK: I'm not sure it's possible to do with logging.  It may actually require more tooling.
  • RL: I am very happy Gen 3 is coming out, worried about how flexible it really is and whether we have tested all of it. There is no question that it is better than Gen 2.
  • RG: My concern is what will happen when you try to share amongst a lot of people.
    • TJ: You're worried about registry overload? RG: Yes
  • RD: From the USDF, the Gen 3 butler brings up questions of data processing and data handling issues
    • TJ: I hope the execution butler will solve all these problems.
    • RD: Also worried about multi-site registration for the products
  • YA: Writing a fresh pipeline task is easy. Our struggles have been getting the same tasks to run the same way in both Gen 2 and Gen 3
    • YA: Hear a lot of complaints from the camera team, but not clear they're actionable. 


10:30Break

Moderator: Wil O'Mullane

Notetaker: Simon Krughoff

11:00DMTN-185 Provenancerecommendations of the Provenance WG and identify which T/CAM(s) owns which so they can accept or reject them
  • REC-EXP-2:
    • Tim: We have a way of associating images together, GROUPID.  Things would get better if we had an M out of N header because we don't know when to run define visits because we don't know when all the data has shown up.
    • RHL: This is really campaign management
    • Tim: Snaps can't be part of campaign management
    • RHL: It's part of it
    • Jim: This seems like perfect enemy of the good territory
    • Frossie: will create an extra meeting to hash this out
  • REC-EXP-3: Frossie will shepherd, but there is obviously a lot about observatory management that has slipped through the cracks.  Will need to bring together multiple sub-systems to hash things out
  • REQ-TEL-001: All data is exported, but could be exported to Kafka
  • REQ-TEL-003:
    • KTL: This is under consideration and is working through the chain
    • Frossie: Does this prevent CSCs hard coding firmware versions
    • KTL: Will have to make sure that's part of the wording
  • REC-SW-2: Patrick, Tiago, Andy, and K-T should meet to hash out whether commanding configuration is in the plan
  • REQ-PTK-003:
    • Frossie: This seems a little scary
    • Jim: I don't think it's that bad except for setting up the right software
    • Tim: This is specifically running a part of the graph
    • Jim: We could provide some tooling to help do this
    • Tim: We have a requirement to do this because of the virtual data products
  • REQ-PTK-005:
    • Jim: If you replace URI with UUID, I think this is solved

DMTN-185 Post facto 2021-10-09

  • REQ-WFL-001: Done by Tim.  Butler datasets.
  • REQ-WFL-002: Ops campaign management project. BPS configuration and logs will be made available by Michelle Butler.  Any other workflow level (docker container version) information will be handled by the campaign management team.
  • REQ-WFL-003:
    • Tim: Campaign management need this
    • Jim: This is part of middleware
    • Tim: segv will not show up
    • Jim: Failed quanta and failed jobs are different.  Former from middleware, latter from BPS logs.
    • Frossie: Do we have the tooling to surface this information through current tooling
    • Tim: Yes, through panda knows about job failures
    • Frossie: Tim owns making sure this information is surface-able
  • REQ-WFL-004: Panda pilot can surface CPU, memory, I/O info
  • REQ-WFL-005: Tim will make sure OS info is in base_packages (sp?).  This should include host node info to the level possible.  This may be via nodeId that means something unique to somebody
  • Frossie to add requirement for node ID inventory at the data centers
  • REC-FIL-001:
    • Gregory: The unique thing is the UUID
    • Tim: But this is not going into the header.  It means all formatters need to know how to write metadata and all readers will need to know that there is (could be) a UUID that should be used.
    • Frossie: If I ship a user a dataset, they have to be able to tell me back what dataset I shipped them.  Whether that is through UUIDs or some other mechanism, there needs to be a way
    • Tim: not all datasets know about metadata
    • Frossie: assuming all science datasets will have metadata is reasonable
  • REC-FIL-002:
    • Gregory will do the study in an ops capacity
  • REC-FIL-003:
    • Tim: This isn't a file level thing 
    • Frossie: Propose to strike based on this is a an understood objective
    • Robert G.: We can strike it, but this is more about tooling later
    • Frossie will move this req to another place
  • REC-SRC-001:
    • K-T will do the census of flags to make sure we can fit in 64 bits for sources and 128 bits for objects with buffer
  • REC-SRC-002:
    • K-T will look  into data release ids fitting in 4 bits
  • REC-SRC-003:
    • With the above two K-T will look in general whether 64 bits is sufficient for source IDs
  • REC-SRC-004:
    • Leanne will provide new language in the DPDD around footprints and heavy footprints and Gregory will collaborate
  • REC-MET-001:
    • Frossie will replace dataId with UUID and claim it
  • REC-MET-002 – Done
  • REC-MET-003:
    • Yusra will drive adding sufficient metadata to persisted Job objects that specific measurements can be looked up from the original butler repository from metadata in the Job. I.e. the repo root, run, collection, and dataId will all need to be knowable from the JSON persisted Job object.
  • REC-MET-004:
    • Yusra will describe how this is done currently with measurements not related to specific datasets like runtimes in jointcal and verify_ap
  • REC-MET-005:
    • Tim: There is no problem with having a special metric measurements backend to butler
    • Frossie will discuss with Yusra whether/how this will be pursued
  • REC-LOG-1:
    • Richard owns logging.  Frossie will coordinate
  • REC-LOG-2:
    • Frossie will make sure log management solutions are in place for all sites
  • REC-LOG-3:
    • Frossie will raise to DPLT
11:45User BatchImpersonation or not? Inside K8s or outside? Integrated with DF systems or not? Could UWS be enough? Are we even ready to start discussing requirements or design? If not now, when?

See Level 3 Definition and Traceability for the collected relevant requirements (summary: they don't constrain this very well).

  • Frossie: Lots of this is lots of work.  Would it be the worst thing in the world to offer batch that requires running exactly like production (e.g. use pipelinetasks)
  • Tim: If we put user auth in Panda, this is basically trivial.  If we offer running arbitrary docker images, this gets way harder
  • KTL: Of course the standard HPC env is a shell prompt, not BPS
  • GPDF: I thought we would go just that route, e.g. batch submission from the command line.  It's late to do something more sophisticated unless we bring in someone else's system
  • Frossie: CADC's model is different from ours, so we can't borrow from them
  • Richard: We are adding cores throughout the project.  My suspicion is that most people won't do image processing, but will be doing random batch processing with results of queries
  • Eric: There is a steep learning curve with out pipelines code if we make them go that route
  • RHL: Colin's use case is the one I really want supported
  • GPDF: The community compute is meant to democratize access, not support large collaborations like DESC completely
  • Wil: I believe we have provided this via notebooks.  People do want dask or spark, but we need a solution that is controllable
  • Frossie: We have always talked about there being a TAC that will manage access
  • Leanne: In ops this is called the User Committee
  • Wil: We may get away without having to have a lot of process around allocation depending on usage patterns
  • Frossie: It is probably best to be legalistic about requirements so that we don't get caught in the situation where we are providing "nice to haves" at the expense of delivering the system we promised
  •  Leanne Guy will provide a reference-able document on interpretation of the user batch requirements that will define the minimum viable system we need to deliver. (Update: requirements will be presented at 2021-10-18 vF2F meeting) 
12:30Break
Moderator: Leanne GuyNotetaker: Wil O'Mullane
13:00Prompt ProcessingKian-Tat LimUse OCPS or start building a more sophisticated execution system for USDF?
  • no detailed design for prompt processing - could use OCPS if we added event rigger
  • RobertG worried about security (OGAs) not allowing this everywhere - baseline is USDF with secure links.
  • Worry about FARO publication to Squash being slow  - Frossie and Leanne agree this is a bug, probably with the squash API and will be solved.
  • LPG: faro writes out single scalar quantities, should not be any issues with storage. 
  • GPDF reminds *Originally* PP was going to be at the Base AND at the Archive/USDF.
  • RHL in favor of using OCPS for prompt - need access to SAL messages
  • Colin worries OCPS is not covering all the open issues - OCPS exists though and could be  a step in the correct direction
  • Eric -if we moved to Chile does Casandra Prompt DB also need to move ? Yes ..
  • Tim - wherever it runs you need to reflect this in OODS, other problems like graph generation will have to be solved. But its not in the planning
  • Jim - gen3 problems are not hard if you don't use the quantum graph generation .. need a bit of time need to be scheduled
  • RHL - if we generalize prompt production a little it will solve lots of problems currently in OCPS
  • Cristian how much space at summit - about one rack ..
  • Frossie worries about running in Chile - OPS IT is unclear .. many other problems. DO min on summit and through away .. separate alerts from OCPS use case. RHL - to say we are only doing sanity checks is not correct .. need multi step scatter gather
  • Richard - sounds like a workflow engine - is PanDA an option to run prompt processing.  Can interface UWS to anything like PanDA
  • Mostly between Tim and KT - WOM wants to stay involved to make sure PP does not get over complicated ..
  • Colin - how does he gain confidence he will have prompt processing..
  • Tim - once DP0.2 is done PP is the priority.
  • KTL will develop design document: DM-30854 - Getting issue details... STATUS
13:30

The exposure table is a key piece of observatory metadata but I have been unable to determine who is in charge of constructing it, and its lack is starting to block work. Gregory Dubois-Felsmannor I can give a brief overview of the state of play but we should identify way forward. 

YA: + what if any is the relationship with the pipeline-output CcdVisit  and Visit  Tables. 

  • Yusra from Sci Pipes - parquet for visits implemented (covers exposure) , some things from EFD like mirror positions is not clear and how to tie it in is not clear.
  • Tim - concern visit and exposure are not the same .. GPDF used both words separately with different meanings. Most exposure info can come from EFD (GPDF plausible). What is the path from EFD to a new header (FITS) in a table each keyword .. GPDF says that is there. Need to get it back into Gen3 - naming needs to be fixed and homogenized. Gen3 formatter needs to get header form this system (per DR)
  • KT lots of meta data calculated at different times upto a year later .. so is it one or multiple things . GPDF need a technical arch for this - may need separate tables ..
  • FE does not want to be pulled into this - there is an aggregating instream formating in kafka - demo of this for weather data to relational table. This should fulfil the needs above but does not solve the data model ?  CloudSQL on IDF for postgress ..
  • Tim - will butler registry at USDF be kept up to date at low latency ..
  • We can release pointing but not pixels faster than 24 hrs ..
  • Yusra - how many tables ? should thing of it partially as data product output of pipelines
  • Richard - plots per exposure or plots per multiple exposures .. Tim put them in butler gen 3 repo ..
  • KT - other place is LFA - but for other data sets we would have had in butler ..
  • Who is going to make this happen ?
  • RHL would like to see it designed ..
  • GPDF there is substance in his two points .. baseline those. (General agreement / no objections.)
  • Who takes the responsibility for moving this onward .. 
14:30 (latest) Close

Day 2, Wednesday June 23

Moderator: Wil O'Mullane

 Notetaker: Colin Slater
09:00
AHM at PCW.

What should we cover ..

  • Rebaseline
  • Ops transition ..
  • Hands on GEN3  ? (IDF or NCSA ?)
  • Tim: Session 1) Gen3 Q&A for developers to ask question of middleware. Session 2) Helping the community switch from Gen2 to Gen3.
  • Simon: Most users either know Gen3 from start, or have already started 2→3 transition.
  • Ian: Good for some of the Gen3 power users (non middleware devs) to lead something from a user perspective. Wil: So a tutorial? Simon: hard to know what issues people are going to have, if we have a tutorial we should also have a Q&A.
  • Wil: Q&A for DM developers. Then Tutorial session. Then slots for "come in and ask question", open to anyone, "this is what I'm trying to do".
  • Yusra: Not great attendance with help/tutorial sessions at prior PCWs.
  • Jim: How many people who would be helped by this are actually planning to attend PCW. Wil: DP0.1 users coming online, some fraction of that might want this? KT: PCW planning on community, can use that to gauge.
    •  Ian Sullivan Discuss within Science Pipelines who should lead a Butler tutorial/QA session at PCW.
  • Tim: CET might already have good tutorials for Gen3.
  • KT: Review of how DM works w.r.t SIT-COM, urgent tickets.
    • Frossie Economou Prepare a "How DM works with SITCOM et al." presentation as part of the PCW DM All Hands session.
  • KT: PCW in Chile? Wil to discuss with Victor. Wil: Add slide to deck
  • RobertG: Docs on Gen3 is required for deprecation.
  • Gregory: DP0 "how it's going" session is canceled? Yes, we have many sessions with Delegates. Frossie will have a "Coffee with RSP Devs" session. Separate session for RSP Devs w/ other data centers.
  • Tim: concerned about duplication of effort between CET gen3 docs and DM gen3 docs. Tim and Leanne will resolve offline. Gregory similar concern. Wil, when does this link back up? After we get feedback from delegates, we'll know more about what was useful. Simon is working on updating the pipelines.lsst.io tutorial to gen3. Yusra: Task docs also exist, need refresh in the fall.
  • Frossie: Russ has a good tech talk on security, arrange with Cristian. Q&A on security.
09:45

Status I.

Team status and brief overview or EPICSs to FY23 given to Kevin.

  • 10 minutes each - link slides in agenda below
  • By coordinates North?
    • UW  47.65, -122.30

    • Princeton 40.34, -74.68

    • Urbana 40.11,-88.19

    • SF 37.76, -122.43 Arch

    • Palo Alto 37.43, -122.15

    • Tucson 32.20, -110.96

    • Chile -29.91, -71.24

Prompt Processing

Data Release Production

LDF

Arch

  • Alert Production
    • RHL: is the plan to have AP prototype processing running at SLAC by next summer? Yes. Eric: Hope that AP effort serves as a forcing function. Fritz: Need to have compute on the floor. There are ways to find compute.
  • Data Release Production
  • NCSA
    • Is NTS going to Chile or Tucson? ITTN-30 gives the test stand plan. (CTS: Couldn't understand the answer on this, someone else should supply)
  • Arch
    • Gregory: Status of RFC-775? Jim hasn't gotten to writing the implementation tickets, will then adopt.
10:30   Break

Moderator: Wil O'Mullane

Notetaker: Kian-Tat Lim

11:00Status II

DAX

DM Science Plans

SQuaRE Update

Chile IT and Networking

DAX:

  • Consider moving schema browser to schema.lsst.io (using LSST-the-Docs infrastructure) Fritz Mueller Ticketed at DM-25399 - Getting issue details... STATUS

Science:

SQuaRE:

IT: IT Update

12:00Wrap up - review actions
  • Schedule a follow-up meeting on the rest of the Provenance recommendations Frossie Economou 
12:30Close (latest)



Proposed Topics

TopicRequested byTime required (estimate)Notes
DM AHM30min?Discuss DM AHM at PCW in August.
DMTN-18530-45minSuggest we walk through recommendations of the Provenance WG and identify which T/CAM(s) owns which so they can accept or reject them
Exposure table45-60min

The exposure table is a key piece of observatory metadata but I have been unable to determine who is in charge of constructing it, and its lack is starting to block work. Gregory Dubois-Felsmannor I can give a brief overview of the state of play but we should identify way forward. 

Prompt Processing30-45min?Use OCPS or start building a more sophisticated execution system for USDF?
User Batch30-45min?Impersonation or not?  Inside K8s or outside?  Integrated with DF systems or not?  Could UWS be enough?  Are we even ready to start discussing requirements or design?  If not now, when?
Conda version pins30min

Doing less pinning in our conda envs lets users install their own things on top at the expense of reproducibility.  Could we start providing both pinned and unpinned versions of each conda env release?  I think it's time to admit that we cannot satisfy all consumers with either minimal pins or maximal pins or even a carefully chosen balance, but I'm hoping we can simultaneously support two envs that each try to satisfy different consumers just as easily.

KTL: We already have this.  The only thing that's lacking is an easy way to create a newinstall environment with the fully-pinned versions.  My version of Gabriele's lsstinstall script (currently on a branch of lsst/lsst) intends to provide this.  Also note that stack (not RSP) containers are effectively pinned unless someone installs something on top.

Community broker selection15 minutesReport back on community broker selection and discuss next steps.


Attached Documents


Action Item Summary

DescriptionDue dateAssigneeTask appears on
  • Frossie Economou Will recommend additional Level 3 milestones for implementation beyond just the DAX-9 Butler provenance milestone.   
15 Mar 2022Frossie EconomouDM Leadership Team Virtual Face-to-Face Meeting, 2022-02-15 to 17
  • Kian-Tat Lim Convene a meeting with Colin, Tim, Robert, Yusra to resolve graph generation with per-dataset quantities (likely based on Consolidated DB work).  
18 Mar 2022Kian-Tat LimDM Leadership Team Virtual Face-to-Face Meeting, 2022-02-15 to 17
  • Frossie Economou Write an initial draft in the Dev Guide for what "best effort" support means  
17 Nov 2023Frossie EconomouDM Leadership Team Virtual Face-to-Face Meeting - 2023-Oct-24
  • Convene a group to redo the T-12 month DRP diagram and define scope expectations Yusra AlSayyad 
30 Nov 2023Yusra AlSayyadDM Leadership Team Virtual Face-to-Face Meeting - 2023-Oct-24
11 Dec 2023Gregory Dubois-FelsmannDM Leadership Team Virtual Face-to-Face Meeting - 2023-Oct-24