RHL: There is a possibility of travel down to Chile as well as to the office.
GPDF: At Caltech starting next week we can go back with almost no restrictions, including having visitors and using meeting rooms. Mandatory return to office will be September.
Aura is not asking about vaccination status; Princeton, SLAC, UW will require vaccinations. Caltech is requiring reporting vaccination status, vaccination likely required upon FDA approval.
Camera official delivery date is August 19. RHL: not actually done at that point, they will still be tinkering with the voltages. WOM: we can't realize or burn down the camera-related risks until they're done modifying it.
Doing less pinning in our conda envs lets users install their own things on top at the expense of reproducibility. Could we start providing both pinned and unpinned versions of each conda env release? I think it's time to admit that we cannot satisfy all consumers with either minimal pins or maximal pins or even a carefully chosen balance, but I'm hoping we can simultaneously support two envs that each try to satisfy different consumers just as easily.
KTL: We already have this. The only thing that's lacking is an easy way to create a newinstall environment with the fully-pinned versions. My version of Gabriele's lsstinstall script (currently on a branch of lsst/lsst) intends to provide this. Also note that stack (not RSP) containers are effectively pinned unless someone installs something on top.
(#1: reproducible, #2: extensible)
JB: If we require a reproducible stack, we may need to include a few additional packages in order to support users
KTL: Cleaning up Gabriele's lsstinstall script. Working on it this morning, since Mario might be able to make use of it now.
KTL: Prior to conda 4.9 anything you installed required the versions of any new packages matched the versions of all dependencies exactly. New versions allow you to install additional packages and update the versions. But, now you have lost complete reproducibility.
KTL: The shared stack is a different problem: we can install additional packages that make developers lives easier, but we also want a minimal development environment without any additional packages.
We can possibly do this for the shared stack, but not for the binary installs.
RHL: Why is this a problem, if I just install new things it shouldn't require changing the build of the stack?
FE: If a user pulls in a new package, it frequently includes updated dependencies that are already in the stack.
TJ: We have loads of flexibility. The problem is that we depend on lots of python packages and if a new package needs a newer version of a package that might break us
KTL: Two ways to go about it: freeze all dependencies, or allow it to float.
KTL: There are ways to add packages to the lab containers or the shared stack, as long as they don't lock in incompatible versions.
TJ: Are you thinking of doing Rubin-extra in addition to Rubin-env?
KTL: Yes.
RHL: I would like to move away from the expectation that we tell people they must install their own packages in the RSP
FE: There is a process for people to add stuff to containers.
FE: In Operations, the RSP is a very slow moving environment tied to official releases.
RHL: I don't understand why we aren't more user friendly with our containers.
FE: We have to determine whether many users need new packages, or if it is just us/RHL. This is what the Data Previews are for, to determine what real users in the wild will need.
FE: The emerging model is that there are varying classes of deployments. For the Telescope environment, we might make the trade-off that we allow users to change the underlying configuration which might break it for everyone, in exchange for rapid development. For the science users, we need an absolutely stable environment.
KTL: It is possible to give users more of a choice, but it means we have more complicated builds.
RHL: It is great that the Telescope team may have a flexible environment, but I worry that will grow to include the entire commissioning team.
FE: My preferred model for Operations is that we have a separate enclave on the Data Facility for developers and one for the thousands of science users.
KTL: We need to have our standard Rubin-env for stable releases, and Rubin-env-extra for the additional packages.
WOM: We might need different Rubin-env-extra environments in different places.
KTL: That should be OK, we can have multiple sets.
JB: I'm willing to live with flexible notebooks that don't guarantee reproducibility, as long as I can always get a minimal build that does guarantee reproducibility.
FE: The problem is that some packages have dependencies in common with the stack, though it's rare. An additional problem is that we don't have a build engineer, so we don't have a dedicated person to solve this.
LPG: I am very happy, and hear from a lot of scientists that they are
SK: Very happy overall, but we must solve the error message when we get an empty quantum graph. It is hard to step through all the data sets to find what is missing, and often it might be something in a late stage so many tasks could have run successfully before the missing part was missing. KSK: I'm not sure it's possible to do with logging. It may actually require more tooling.
RL: I am very happy Gen 3 is coming out, worried about how flexible it really is and whether we have tested all of it. There is no question that it is better than Gen 2.
RG: My concern is what will happen when you try to share amongst a lot of people.
TJ: You're worried about registry overload? RG: Yes
RD: From the USDF, the Gen 3 butler brings up questions of data processing and data handling issues
TJ: I hope the execution butler will solve all these problems.
RD: Also worried about multi-site registration for the products
YA: Writing a fresh pipeline task is easy. Our struggles have been getting the same tasks to run the same way in both Gen 2 and Gen 3
YA: Hear a lot of complaints from the camera team, but not clear they're actionable.
recommendations of the Provenance WG and identify which T/CAM(s) owns which so they can accept or reject them
REC-EXP-2:
Tim: We have a way of associating images together, GROUPID. Things would get better if we had an M out of N header because we don't know when to run define visits because we don't know when all the data has shown up.
RHL: This is really campaign management
Tim: Snaps can't be part of campaign management
RHL: It's part of it
Jim: This seems like perfect enemy of the good territory
Frossie: will create an extra meeting to hash this out
REC-EXP-3: Frossie will shepherd, but there is obviously a lot about observatory management that has slipped through the cracks. Will need to bring together multiple sub-systems to hash things out
REQ-TEL-001: All data is exported, but could be exported to Kafka
REQ-TEL-003:
KTL: This is under consideration and is working through the chain
Frossie: Does this prevent CSCs hard coding firmware versions
KTL: Will have to make sure that's part of the wording
REC-SW-2: Patrick, Tiago, Andy, and K-T should meet to hash out whether commanding configuration is in the plan
REQ-PTK-003:
Frossie: This seems a little scary
Jim: I don't think it's that bad except for setting up the right software
Tim: This is specifically running a part of the graph
Jim: We could provide some tooling to help do this
Tim: We have a requirement to do this because of the virtual data products
REQ-PTK-005:
Jim: If you replace URI with UUID, I think this is solved
DMTN-185 Post facto 2021-10-09
REQ-WFL-001: Done by Tim. Butler datasets.
REQ-WFL-002: Ops campaign management project. BPS configuration and logs will be made available by Michelle Butler. Any other workflow level (docker container version) information will be handled by the campaign management team.
REQ-WFL-003:
Tim: Campaign management need this
Jim: This is part of middleware
Tim: segv will not show up
Jim: Failed quanta and failed jobs are different. Former from middleware, latter from BPS logs.
Frossie: Do we have the tooling to surface this information through current tooling
Tim: Yes, through panda knows about job failures
Frossie: Tim owns making sure this information is surface-able
REQ-WFL-004: Panda pilot can surface CPU, memory, I/O info
REQ-WFL-005: Tim will make sure OS info is in base_packages (sp?). This should include host node info to the level possible. This may be via nodeId that means something unique to somebody
Frossie to add requirement for node ID inventory at the data centers
REC-FIL-001:
Gregory: The unique thing is the UUID
Tim: But this is not going into the header. It means all formatters need to know how to write metadata and all readers will need to know that there is (could be) a UUID that should be used.
Frossie: If I ship a user a dataset, they have to be able to tell me back what dataset I shipped them. Whether that is through UUIDs or some other mechanism, there needs to be a way
Tim: not all datasets know about metadata
Frossie: assuming all science datasets will have metadata is reasonable
REC-FIL-002:
Gregory will do the study in an ops capacity
REC-FIL-003:
Tim: This isn't a file level thing
Frossie: Propose to strike based on this is a an understood objective
Robert G.: We can strike it, but this is more about tooling later
Frossie will move this req to another place
REC-SRC-001:
K-T will do the census of flags to make sure we can fit in 64 bits for sources and 128 bits for objects with buffer
REC-SRC-002:
K-T will look into data release ids fitting in 4 bits
REC-SRC-003:
With the above two K-T will look in general whether 64 bits is sufficient for source IDs
REC-SRC-004:
Leanne will provide new language in the DPDD around footprints and heavy footprints and Gregory will collaborate
REC-MET-001:
Frossie will replace dataId with UUID and claim it
REC-MET-002 – Done
REC-MET-003:
Yusra will drive adding sufficient metadata to persisted Job objects that specific measurements can be looked up from the original butler repository from metadata in the Job. I.e. the repo root, run, collection, and dataId will all need to be knowable from the JSON persisted Job object.
REC-MET-004:
Yusra will describe how this is done currently with measurements not related to specific datasets like runtimes in jointcal and verify_ap
REC-MET-005:
Tim: There is no problem with having a special metric measurements backend to butler
Frossie will discuss with Yusra whether/how this will be pursued
REC-LOG-1:
Richard owns logging. Frossie will coordinate
REC-LOG-2:
Frossie will make sure log management solutions are in place for all sites
Impersonation or not? Inside K8s or outside? Integrated with DF systems or not? Could UWS be enough? Are we even ready to start discussing requirements or design? If not now, when?
Frossie: Lots of this is lots of work. Would it be the worst thing in the world to offer batch that requires running exactly like production (e.g. use pipelinetasks)
Tim: If we put user auth in Panda, this is basically trivial. If we offer running arbitrary docker images, this gets way harder
KTL: Of course the standard HPC env is a shell prompt, not BPS
GPDF: I thought we would go just that route, e.g. batch submission from the command line. It's late to do something more sophisticated unless we bring in someone else's system
Frossie: CADC's model is different from ours, so we can't borrow from them
Richard: We are adding cores throughout the project. My suspicion is that most people won't do image processing, but will be doing random batch processing with results of queries
Eric: There is a steep learning curve with out pipelines code if we make them go that route
RHL: Colin's use case is the one I really want supported
GPDF: The community compute is meant to democratize access, not support large collaborations like DESC completely
Wil: I believe we have provided this via notebooks. People do want dask or spark, but we need a solution that is controllable
Frossie: We have always talked about there being a TAC that will manage access
Leanne: In ops this is called the User Committee
Wil: We may get away without having to have a lot of process around allocation depending on usage patterns
Frossie: It is probably best to be legalistic about requirements so that we don't get caught in the situation where we are providing "nice to haves" at the expense of delivering the system we promised
Leanne Guy will provide a reference-able document on interpretation of the user batch requirements that will define the minimum viable system we need to deliver. (Update: requirements will be presented at 2021-10-18 vF2F meeting)
Use OCPS or start building a more sophisticated execution system for USDF?
no detailed design for prompt processing - could use OCPS if we added event rigger
RobertG worried about security (OGAs) not allowing this everywhere - baseline is USDF with secure links.
Worry about FARO publication to Squash being slow - Frossie and Leanne agree this is a bug, probably with the squash API and will be solved.
LPG: faro writes out single scalar quantities, should not be any issues with storage.
GPDF reminds *Originally* PP was going to be at the Base AND at the Archive/USDF.
RHL in favor of using OCPS for prompt - need access to SAL messages
Colin worries OCPS is not covering all the open issues - OCPS exists though and could be a step in the correct direction
Eric -if we moved to Chile does Casandra Prompt DB also need to move ? Yes ..
Tim - wherever it runs you need to reflect this in OODS, other problems like graph generation will have to be solved. But its not in the planning
Jim - gen3 problems are not hard if you don't use the quantum graph generation .. need a bit of time need to be scheduled
RHL - if we generalize prompt production a little it will solve lots of problems currently in OCPS
Cristian how much space at summit - about one rack ..
Frossie worries about running in Chile - OPS IT is unclear .. many other problems. DO min on summit and through away .. separate alerts from OCPS use case. RHL - to say we are only doing sanity checks is not correct .. need multi step scatter gather
Richard - sounds like a workflow engine - is PanDA an option to run prompt processing. Can interface UWS to anything like PanDA
Mostly between Tim and KT - WOM wants to stay involved to make sure PP does not get over complicated ..
Colin - how does he gain confidence he will have prompt processing..
Tim - once DP0.2 is done PP is the priority.
KTL will develop design document:
DM-30854
-
Getting issue details...STATUS
The exposure table is a key piece of observatory metadata but I have been unable to determine who is in charge of constructing it, and its lack is starting to block work. Gregory Dubois-Felsmannor I can give a brief overview of the state of play but we should identify way forward.
YA: + what if any is the relationship with the pipeline-output CcdVisit and Visit Tables.
Yusra from Sci Pipes - parquet for visits implemented (covers exposure) , some things from EFD like mirror positions is not clear and how to tie it in is not clear.
Tim - concern visit and exposure are not the same .. GPDF used both words separately with different meanings. Most exposure info can come from EFD (GPDF plausible). What is the path from EFD to a new header (FITS) in a table each keyword .. GPDF says that is there. Need to get it back into Gen3 - naming needs to be fixed and homogenized. Gen3 formatter needs to get header form this system (per DR)
KT lots of meta data calculated at different times upto a year later .. so is it one or multiple things . GPDF need a technical arch for this - may need separate tables ..
FE does not want to be pulled into this - there is an aggregating instream formating in kafka - demo of this for weather data to relational table. This should fulfil the needs above but does not solve the data model ? CloudSQL on IDF for postgress ..
Tim - will butler registry at USDF be kept up to date at low latency ..
We can release pointing but not pixels faster than 24 hrs ..
Yusra - how many tables ? should thing of it partially as data product output of pipelines
Richard - plots per exposure or plots per multiple exposures .. Tim put them in butler gen 3 repo ..
KT - other place is LFA - but for other data sets we would have had in butler ..
Who is going to make this happen ?
RHL would like to see it designed ..
GPDF there is substance in his two points .. baseline those. (General agreement / no objections.)
Who takes the responsibility for moving this onward ..
Tim: Session 1) Gen3 Q&A for developers to ask question of middleware. Session 2) Helping the community switch from Gen2 to Gen3.
Simon: Most users either know Gen3 from start, or have already started 2→3 transition.
Ian: Good for some of the Gen3 power users (non middleware devs) to lead something from a user perspective. Wil: So a tutorial? Simon: hard to know what issues people are going to have, if we have a tutorial we should also have a Q&A.
Wil: Q&A for DM developers. Then Tutorial session. Then slots for "come in and ask question", open to anyone, "this is what I'm trying to do".
Yusra: Not great attendance with help/tutorial sessions at prior PCWs.
Jim: How many people who would be helped by this are actually planning to attend PCW. Wil: DP0.1 users coming online, some fraction of that might want this? KT: PCW planning on community, can use that to gauge.
Ian Sullivan Discuss within Science Pipelines who should lead a Butler tutorial/QA session at PCW.
Tim: CET might already have good tutorials for Gen3.
KT: Review of how DM works w.r.t SIT-COM, urgent tickets.
Frossie Economou Prepare a "How DM works with SITCOM et al." presentation as part of the PCW DM All Hands session.
KT: PCW in Chile? Wil to discuss with Victor. Wil: Add slide to deck
RobertG: Docs on Gen3 is required for deprecation.
Gregory: DP0 "how it's going" session is canceled? Yes, we have many sessions with Delegates. Frossie will have a "Coffee with RSP Devs" session. Separate session for RSP Devs w/ other data centers.
Tim: concerned about duplication of effort between CET gen3 docs and DM gen3 docs. Tim and Leanne will resolve offline. Gregory similar concern. Wil, when does this link back up? After we get feedback from delegates, we'll know more about what was useful. Simon is working on updating the pipelines.lsst.io tutorial to gen3. Yusra: Task docs also exist, need refresh in the fall.
Frossie: Russ has a good tech talk on security, arrange with Cristian. Q&A on security.
09:45
Status I.
Team status and brief overview or EPICSs to FY23 given to Kevin.
RHL: is the plan to have AP prototype processing running at SLAC by next summer? Yes. Eric: Hope that AP effort serves as a forcing function. Fritz: Need to have compute on the floor. There are ways to find compute.
Data Release Production
NCSA
Is NTS going to Chile or Tucson? ITTN-30 gives the test stand plan. (CTS: Couldn't understand the answer on this, someone else should supply)
Arch
Gregory: Status of RFC-775? Jim hasn't gotten to writing the implementation tickets, will then adopt.
The exposure table is a key piece of observatory metadata but I have been unable to determine who is in charge of constructing it, and its lack is starting to block work. Gregory Dubois-Felsmannor I can give a brief overview of the state of play but we should identify way forward.
Impersonation or not? Inside K8s or outside? Integrated with DF systems or not? Could UWS be enough? Are we even ready to start discussing requirements or design? If not now, when?
Doing less pinning in our conda envs lets users install their own things on top at the expense of reproducibility. Could we start providing both pinned and unpinned versions of each conda env release? I think it's time to admit that we cannot satisfy all consumers with either minimal pins or maximal pins or even a carefully chosen balance, but I'm hoping we can simultaneously support two envs that each try to satisfy different consumers just as easily.
KTL: We already have this. The only thing that's lacking is an easy way to create a newinstall environment with the fully-pinned versions. My version of Gabriele's lsstinstall script (currently on a branch of lsst/lsst) intends to provide this. Also note that stack (not RSP) containers are effectively pinned unless someone installs something on top.
Kian-Tat Lim Convene a meeting with Colin, Tim, Robert, Yusra to resolve graph generation with per-dataset quantities (likely based on Consolidated DB work).