Rapid Analysis Design Review Series

Session 2 :

When: Friday January 26 9am-11:00am PT
Where: https://princeton.zoom.us /my/yusra

Confirmed Attendees:

Merlin Fisher-Levine Kian-Tat Lim Yusra AlSayyad Frossie Economou Colin Slater Fritz Mueller Jim Bosch Hsin-Fang Chiang Wil O'Mullane

Invited (please move your name to confirmed ^^^ if you can make it):

Russ Allbery

Objectives:

The final 5 minutes from last time.

WO: was reflecting on Rapid/Prompt considering they are both CCD they seem more similar than different. Is the real difference that Prompt loads some specific sky data .. otherwise if with fixed the ccd for prompt it would load simliar data to rapid ..
FE: Merlin, what do you think about Jim's question. Jim says custom pipelinetask executors can do anything. I'd be intereste in looking into whether we could make make the channels pipeline tasks with custom executors watching and doing io.
MFL: We could, by why? wouldn't you suffer from slow startups again?
KTL: With a custom activator you can do a lot
KTL's where do we go from here:
- "Summit Services Team" seems appropriate
- Rapid Analysis is a critical DM customer because it supports observing, and tickets (e.g. from a bug in DM code) may deserve the urgent flag
- framework could benefit from DM input, design, and deployment (now and even after summit services exist)
- DM could help improve testing now.
Frossie's where do we go from here:
- Technote to join RA with OCPS and also to merge the auto-ingest with RA
- Survive Commissioning
- Get back to some of the ideas for consolidating RA with Prompt, in case we decide have merit
  Sidecar: Who looks after this. Not convinced MAR ("Michael Reuter" a.k.a Summit Services) group is the right answer for this one
  Sidecar: Duchies
FE: Rapid Analysis wasn't what I had in mind for the summit services team.
WO: We discussed code that was developed by DM but deployed at the summit
KTL: Who owns the framework is what is important.
FE: some of the processing frameworks are in Tim's duchy. If this ends up in DM then that person should be looped in.

================================================================================================

Questions to discuss today. OK. I guess the answer is ALL of them:

Who owns the framework? Is "Summit Services Team" appropriate?
- Sidecar: Who looks after this. Not convinced MAR ("Michael Reuter" a.k.a Summit Services) group is the right answer for this one
- We can't declare at the end of the hour that it is either Tim or Michael, but we can define some homework problems. We're not here to review past decisions, but what is best for the next steps. What are the next steps.
- FE: I propose (before he talk to Michael (and/or Tim) that we first write because it would be more palatable to replace something I'm supporting with supporting something else. ktl: though the RA is more complex than OCPS.
  - Technote to join RA with OCPS and also to merge the auto-ingest with RA.
- CTS: How about a technote on consolidating RA and Prompt. Even if there's no effort in the short term, knowing what the technical situation is that prevents their merging would be useful not.
  - KTL: I started https://dmtn-255.lsst.io/v/u-ktl-initial-draft/index.html
  - CTS: I would like a "what would it take to accomplish both goals with the same."
  - Maybe we'll say "too much work" "Ok, but later.." or "nah"
  - JFB: You could merge it by saying you can merge it by "you can't have this functionality" (i.e. relax its demands) or adding modes. CTS: It'll prob be a mix.
- CTS: It's important to include middleware in the scope of this. There's a situation where different middleware code would let you do processing in the speed requirements that RA is asking for with a framework that's different. JFB: i.e. if someone felt empowered to make the changes in middleware code, they may have done it differently.
- KTL: Let's add a section that addresses this.
- KTL: Could we ever have different frameworks that all run ISR, can we say that now?
- HF: I don't want to merge PP and RA now. YA: and I don't want to block acknowledging RA on the plan for merging PP and RA. FE: Yes, let's block it on the merging with OCPS, but not PP. CTS: the PP/RA merger could lessen the load long term. The load would be imposed on DM by adding as something that we need ot be maintained. MFL: I am convinced that Rapid Analysis COULD absorb auto-ingest and OCPS. But practically, no one what done any develop in OCPS/auto-ingest in forever. The maintenance of those could be negligible. KTL: indeed. ocps has been in a state of benevolent neglect for a long time. FE: But what if a Python version change breaks it at the worst time. CTS: "merging OCPS+RA" could mean either 1) make RA commandable by SAL or 2) migrate the payloads to RA without making RA commandable. Need technote needed to resolve that question.
JFB request to talk about triggers and elasticity while he's here.
- Woudl it satisfy OCPS use cases if that because data-driven? KTL: ocps now: a script colelctions biases, and executes an ocps command to process those biases. Then it takes a bunch of darks. Then it processes a command to process those darks (which use biases as a calibration data product). the script can wait for collection of the images, it also can wait for a single image to show up. WO: Robert argued that it was impossible to automate what to do with the various images. FE: You can put flags in the header that tell the processor what to do. The assertion that you can't have a data driven pipeline because only the person knows what data they're taking and what they want to do with it is <argh>...KTL: the other aspect is returning a response. JFB: If you're dumping something in the redis database, you could look for something in the redis database. OCPS isn't latency critical. but it is possible that we'll have to take calibs in the middle of the night. CTS: getting hte response back seems like the part we can give up. all: agreed.
- KTL: on the auto-ingest side. the triggers are similar. Iuse the numer of detectors nodes. the operater can expand. This would merge right in with Merlin's workers. the prep is a subset of the prep the workers are already doing, re establishing a butler connection. The <?> is where all the writing will go.
- There are ways we could split prompt processing up into a next visit trigger and an image arrival trigger that relies on fast shared storage.
- note taker taking a bio break
Rapid Analysis is a critical DM customer because it supports observing, and tickets (e.g. from a bug in DM code) may deserve the urgent flag
framework could benefit from DM input, design, and deployment (now and even after summit services exist)
DM could help improve testing now.

WRAP UP:

Request for testing payloads in Jenkins. making the payloads a pipeline task would make this easy.
- yusra will schedule something.
Technote for what it will take to merge OCPS/Auto-ingest and RA. K-T
Extend https://dmtn-255.lsst.io/v/u-ktl-initial-draft/index.html to include what it will take in a technical sense to merge them. The analysis will be useful, to guide microdecisions between now and if/when we do the merge. But it will not block adoption of RA. Plan is to through commissioning first.
- Rapid Analysis is a critical DM customer because it supports observing, and tickets (e.g. from a bug in DM code) may deserve the urgent flag (WO: Yes)
Side note: FE build person for testing framework that tests frameworks.
Whether we recommend Michael (outside DM) or Tim (inside DM) to be determined after technotes.

Session 1 :

When: ~~January 12~~ JANUARY 19 ~~9am-12pm PT.~~ 8:30am-11:00am PT (Jim and Yusra disappear at 11am PT)
Where: https://princeton.zoom.us /my/yusra

Confirmed Attendees:

Merlin Fisher-Levine Kian-Tat Lim Yusra AlSayyad Frossie Economou Jim Bosch Hsin-Fang Chiang Colin Slater Wil O'Mullane Fritz Mueller Russ Allbery

Objectives:

This review is limited to Rapid Analysis ONLY. (Yes, I know there is lot of other summit software the DM folks look after. Another time)

Answering the questions:

What DM function does Rapid Analysis fulfill that can not be done by any other DM software?
How does Rapid Analysis's architecture fulfill this function?
Where do we go from here?
- Get time from people working on rapid analysis on what they think its needs so we can help them get it done.

Request from Frossie: a walkthrough of “this is what this does, and we wrote it for this reason” so we can collectively go “but wait, can’t we use X for just this one thing that is already supported by another team” and maybe the answer is yes and maybe the answer is no, and maybe the answer is “sure if only this other thing was fast enough / in python / whatever”. And then we tally up any Xes and see if we can thin down the problem and make sure whatever is left (or all of it, if the first phase is unfruitful) is resourced and clearly belongs to someone.

YA: I interpreted this as a loop through the various payloads that are run on rapid analysis.

Frossie:

Is this sound? Where does this fit in the context of the other processing frameworks. Are those sound.

Slides: https://docs.google.com/presentation/d/1QqXW2HRJO_LAFb0SRLYA92Luj1ljrgZW_2GFSAxrH3s/edit?usp=sharing

Homework:

Here is a draft technote on Rapid Analysis. If you can read it ahead of time, it will help us make the best use of our time:

https://sitcomtn-100.lsst.io/v/DM-42117/index.html

Notes:

FE: Can we talk about latency at some point. The technote mentioned it a few times, and I think some of the latencies are overstated.

FE: Need a breakout on spin-up/processing.

FE: Tradeoffs between falling behind by sampling KTL: and you'd want to change the pattern.

Questions/comments from Jim

Processing control: do you really need all of this flexibility for different patterns? Would reducing this flexibility help with maintenance much?

MFL: Flexibility - I don't actually know how much we need it, and tbh, I just made it up because it sounded like a good idea (but that's true of basically everything). Some it is essential, I think, but the fancy patterns probably less so? However, it adds minimal complexity, so I don't think it's really a maintenance issue at all, it just sits in the fanout service. I can show you the code, but it's pretty trivial.
KTL: Does "patterns" here mean different subsets of the focal plane or different patterns of execution of payloads in general? The latter is a bit less trivial, I would imagine.
MFL: From Jim: yes, this meant CCD patterns, and I think we agree it's all fine as it's a) simple and b) already has tests!

Could "run an arbitrary-ish Pipeline against a Butler" be something Rapid Analysis *also* does, without any expectation that it would do everything that way?

MFL: I'm not sure I understand the question, sorry. But in case it helps: some pipelines are run with a butler, some without. All the SFM stuff is with though.
KTL: I think it's trivial for a RA payload to execute pipetask (or activate it via API).
MFL in conversation with Jim: this is about adding something generic, in order to make merging with OCPS (or something like that) easier. And MFL agrees that this would be easy.

Would being able to configure Rapid Analysis payloads dynamically via some other system (LOVE?) address the use cases that OCPS satisfies without actually needing SAL triggers?

MFL: Quite possibly, and yes, LOVE is certainly a good shout for that, and LOVE can communicate with RA config via redis already.
KTL: I don't think it's that easy. The idea was to have a script that said "take these exposures" and then said "process exactly these exposures" and then maybe do something based on the results. Having this be a data-driven "mode" begs the question of how you know when the group of exposures is done and how you know when the processing is finished.
MFL: Could OCS write to S3 to announce what it's taking?

Is it conceivable to drop re-gather support to reduce complexity and maintenance? If not, is it conceivable to move it out of Rapid Analysis into some other system, at the cost of making the latency much higher?

MFL: Drop - I don't think so, it's a big issue, and I don't know of a better solution. Move - possibly, but 1) when things are really patchy you don't want it to be too far behind, 2) I'm not sure you gain much by moving it, 3) it's possible that that might make it worse in complexity because then a different framework has to work out what to do and when based on what RA's done. It's quite possible we just need to talk with words here though.

Custom PipelineTask executors can basically do anything (including not actually use a Butler). I would be interested in looking into whether we could actually make most of the various channels into PipelineTasks *if* we had a custom executor to do the watching and I/O. But maybe we should think first about what problems it might solve (perhaps testing of payloads in other contexts, perhaps reducing the amount of bespoke orchestration code that isn't tested elsewhere, since many aspects of PipelineTask execution are tested elsewhere).

MFL: Sounds very interesting - let's talk

Where is the lack of test coverage felt most acutely?

Snowflake payloads?
Stuff that runs payloads within pods?
Orchestration between pods?

MFL: Lack of CI makes refactoring type work hard, so it's a little of everything tbh.

What makes testing hard?

Hard to assemble test data to feed to payloads?
It's testing the orchestration that's hard?
We know how we'd do it, we just don't have the time.

MFL: it's partly that it's never been a priority (not that it's not important, but, well, always needing to add new stuff urgently), partly that I don't know where and how they'd run, and yes, very much assembling the input data, and because it's data-driven in many places, some of it needs a test service that sort of adds the data in realtime and checks the processes do what they were meant to, which is hard.

Frossie's where do we go from here:

Technote to join RA with OCPS and also to merge the auto-ingest with RA
Survive Commissioning
Get back to some of the ideas for consolidating RA with Prompt, in case we decide have merit
Sidecar: Who looks after this. Not convinced MAR ("Michael Reuter" a.k.a Summit Services) group is the right answer for this one
Sidecar: Duchies

Next steps: shorter debrief meeting soon

Space shortcuts

Page tree

Session 2 :

Confirmed Attendees:

Invited (please move your name to confirmed ^^^ if you can make it):

Objectives:

Session 1 :

Confirmed Attendees:

Objectives:

Homework:

Questions/comments from Jim