Quick check in on OR#2 for commissioning - we put this off for Gen3 and ComCam move to summit etc.
Now we have Gen3 repos -
ComCam is up on Floor3.
Should we do something to ensure we are ready for the integration on Floor 3 ?
EFD , ComCam, Hexpod , cable wrap ..
Or a repeat of last time but with Gen3
We will do another Ops rehearsal. It will be Gen 3 focused. Schedule TBD. ComCam is now on the summit, so we have options. RHL: it will be good if we can make it work for us to learn things we need to learn.
RHL: What do you mean by logs? FE: Everything. Not sure how to rendezvous camera logs, with pipeline logs e.g.
GPDF: These are process logs, not science log books? FE: Correct
KSK: We are not doing hardware provenance.
KTL: What about provenance of experimental hardware? GPDF: We will be linking to the maintenance management system (MMS). That should tell you what hardware is where and when, but we are assuming that system will be fully functional without input from this working group.
KTL: It's possible that the YAML description in obs_lsst could get out of sync with the on summit reality if these are not intentionally linked up
TJ: Detector serial numbers are directly encoded in the image headers, so no need to go back to the MMS for that. RHL: But raft serials are not
RHL: There are systems that may be dropped that the WG could raise as important sources of provenance that should go forward.
Gregory Dubois-Felsmann : Will help with word-smithing in the report to call out the need to preserve any camera metadata/telemetry that is not ending up in the EFD. Due by end of Provenance WG tenure.
FM: Sizing model? KTL: If there is a recommendation to get rid of heavy footprints, it will help since they are currently in
FE: Source IDs are a concern. See the report with recommendations.
GPDF: Do we need to not worry about compressed PVIs? KTL: We still need to worry. We are assuming a compression factor for those.
RHL: When is the report due out? FE: By next DMLT F2F. RHL: I'd like to see a draft.
Frossie Economou Will circulate Provenance WG document when draft is ready.
Container builds across RSP, Prompt Processing, DRP, Telescope & Site
Non-service non-stack packaging
KTL - should we built T&S software with stack? which with same tech or at same time ?
Merlin would like the entangled part released daily ..
It would be good to notice if something on T&S breaks when we update stack .. problem is finding out later.
Tim notes in chat that PanaDA now using lsst_distrib containers.
RHL - painted ourselves in a corner - we should now ask on the mountain we want to be able to developer type things – need to get together and decide how to achieve that .. feels the docker method is more aimed at stable releases.
Not entirely agreed - containers do not make it slower necessarily..
Conways law - currently Science Platform, T&S Software etc all under one lead now - not in ops.
Pick a day in march to try to settle some of this .. seed some tech notes ..
JB - also need to look at release and patches - rather than using the bleeding edge .
this was suggested some people recoil from it
only way we currently test T&S and Stack is on the summit .. need some tests (Notebook can be driven by nublado)
KTL intent was to use NCSA test stand for this - but AuxTel was not running there
RHL says it may be Tucson test stand .
Colin - whats the rigidity of T&S versioning situation
Fundamental T&S issue with XML - interface change requires all to be built ..
But it may be the way we are packaging this with stack which is the problem
TimJ points out Nublado is used for observing which was not supposed to happen - Nublado was meant to insulated from this.
Go back and create cleaner interfaces ..
KTL - T&S does not have a build engineer in ops - do we we actually have anyone to do any out put
Interfaces or Monolith ..
some background given on lack of control system and why notebooks came in to being
GPDF - could use cookie between two containers - but htis is the discussion we should have in the meeting
Frossie Economou Doodle poll a day in March for a discussion on container builds with coordinated group DM and T&S ..
FM - s this agnostic to which things are colocated on site (slide 5)
assumes single site .. but workflows should be single site but campaigns across sites
Frossie- is campaign offline or data drip
would include data drip ..
GPDF - slide 4 - pipeline not mentioned in definitions is workflow a pipeline execution
some discussion on that TimJ clarified (sorry missed it)
GODF Marking intermediate products bad and keeping them out of downstream processing was a critical feature of the _BaBar_ workflow system.
JB - It is important - dataids may not be enough then -
exclusions may change for new campaign
may be better to record it more explicit to the campaign ..
TJ - not automated enough - no plan for using generated metrics in down stream . The provenance records which inputs were used.
KTL some up front setup from humans - pipelines should take care of some of the
FE - is there an interface that could be used by Computers as well as humans ? DO humans have to be involved in nightly
yes and no – some human needs to configure the nightly processing .. no one should have to push a button
Implementation is important - do not want to encourage very manual intervention
RHL - exclusions from looking at a flat are not different to QA systems flagging things .. its a continum. We will find things in the middle and say we should have flagged this - can we go back and figure out which products are affected ?
TJ - we absolutely need a way to know that if Flat X is suddenly determined to be bad then we can find all products that used X and redo them
if you change the campaign its a new one - you dont have to run all of it again of course (if gen3 allows you to use partial workflows.
RHL want to continue the processing with some new exclusion .. not necessarily a new campaign - it may be semantics .
TJ Regarding provenance, you could imagine that we could rebuild a graph from the provenance of that product and just remake that product - JB agrees
TJ - does this cover which data was processed ?
not this system
holes in processing would be important though ..
KTL thinks perhaps this would be an addon "linter" ..
GPDF - if campaign is immutable and runs for 8 months need something which tracks
could include campaigns
I think I'd like semantic changing of versions to just be one case of semantic versioning of software releases, with all config changes recognized as essentially the same as a change to the software.
JB - levels now defined work if we push them down to level of workflow
I think I'd like semantic changing of versions to just be one case of semantic versioning of software releases, with all config changes recognized as essentially the same as a change to the software.
Iterations add workflow in the middle of pre existing workflows ..
progress tool needs to consider how the chunking was done
RHL this implies sematic versioning
WOM not all campaigns are equal .. some we restart and interact with some we must always start from the beginning.
FE - do not want two ways to get the processing done - manual and automatic, if we do not put pressure on automation it will not happen. We would not want to be manually processing data after 10 years ..
KTL - trying to find clear places where automation can be added. Would like more pressure to get automated ..
Not yet an assignment of who will build his - can not promise the scope
So want a framework that could be used
FE - when we have made things simple before its bitten us - we should define what we need and get a high level developer to implement it - not just anyone can pick it up.
RHL - does not understand the worry about (lack of) automation - will do small QA runs and we will inject, KT has defined mechanisms that allow that.
Frossie once you provide tooling for humans it takes effort and a better system could have been built to automate it
SK .e. if the name of the person running the pipeline changes the configs would evaluate to true, but if the size of the aperture for aperture correction changes it would evaluate as false
TJ - problem say good seeing - want the graph builder to understand it right ? or is a user making a query to do the query and get the list of raws .. what about 10million input files - calculate metrics which are used in down stream tasks .. if its not wanted we do not need to build it
JB assume we were building this ..
KT an interface for a list of exclusions is only an option not the main way - there should be a selector function built in
RHL offers HSC example in deep fields - TJ does not think anyone is AGAINST exclusion lists
JB we already have a system for this which is not using an explicit list ..
YA In chat pointed out the metrics may be selected on if stored as datasets - they do not have to be in the registry.
GPDF - .e. if the name of the person running the pipeline changes the configs would evaluate to true, but if the size of the aperture for aperture correction changes it would evaluate as false
TJ in favor of exclusion lists - there are multiple levels as part of the observatory, everything else should go through single frame processing , some you will never want in a coadd
its more complex - ideally it would always come from the metrics ..
different selector for each coadd (YA)
CS - easier to think how it all interacts when we have a design for straw system ordinarily .. particular mapping of campaign to workflow below it.
list of BPS commands put in a form
how you come up with that list is another intellectual/science problem
but that is exactly the part that needs to be worked out
RHL suggests do this concretely for HSC ..
TODO
Seems fleshing out the external tools (slide 7) would be useful
tooling to generate the BPS lists and exclusion lists
provenance from previous campaign to come up with new one ..
DM is not scoped to provide support in construction. Now that we have CET funded in pre-ops and they are building a model for community engagement in operations, I'd like to hand the management and evolution of the community platform to the OPS-CET
FE: can we use "deliver to operations" not "handover to operations"
DM/DP will continue supporting the service, emphasis that it is for in-project use as well as community service
CET has authority to design front page, assign moderators, etc.., need to maintain private groups/private topics for internal communication tool for DM
We could write this division of responsibilities down in a tech note
RHL: It is hard to transmit project knowledge to CET. support channels are said to be essential, but are not sustainable
MLG: continued participation of DM expertise is essential
RHL: How does the DM side of support scale?
WOM: That is an operations issue, and not a DM issue
CET is meant to be the curator and first line of defense against science questions.
That means DM/DP does not have to monitor Community, but CET may call on specific experts to answer hard questions when they come up.
TJ : Isn't the exposure table just a table with all the fits headers and not a derived quantity? GPDF: I don't remember that being the case. KT: Confirms. DPDD doesn't even have a schema for the exposure table. GPDF: I'm more worried about the visit table, which has metrics.
TJ: Could we do a join of all the squash metrics and a link to the exposures? GPDF: Yes, and can we map that onto CAOM2? We previously acknowledged that CAOM2 was awkward to use in production, but I think we can still use it afterward. This is lookback. TJ: Ah, It's not the concept of visit, but the actual post-processed visit.
FE: Big Big fan of cutting down data models. There are way too many now, and it's hard to deal with them. I would love to go to the two that Gregory has suggested. KT: As long as there are two and not more that...dynamic observing.
Slide: 9 Metadata creation and loading workflows. DRPs. K-T: This also relates back to campaign management activities. GPDF: If we can extract gen3 into the obsCore data model, we can do that on the fly and get a respectable image browser. There's tooling that can use that.
Slide 10: Nightly Processing WO: It is looking like we will be asked to hold > 6hr. CS: The metadata is useful for interpreting the alert stream itself. It'd be weird if the metadata for the alert weren't available until 24 hours later. GPDF: Yes, we should be able to at least record that we HAVE TAKEN an observation. People would be able to deduce that from the alerts anyway. KT: Some metadata is OK to release. The fact that we took a picture, what airmass or seeing is not a problem. WO: People are ONLY worried about the pixels. Everything else is OK. KT: So now we're back to <6h release of the metadata. FE: It's frustrating, but we understand what they're worried about. We should get ahead of it and say: here's what we want to do without releasing the pixels. What makes me nervous is that'll anyone can call Wil and tell them that they have a draconian solution to this problem he has to implement.
GPDF: I know Frossie has opinions on this. FE: Having a mode where we can serve static files (avoiding computation which is required by the baseline image service design) KT: if everyone's looking at the same asteroid, we can cache it. There was also that statement that "If people are using the pixels and not the catalogs, then we failed." Notetaker's editorial: boo WO: That's basically an FTP server. FE: noooooooo
GPDF (running out of time): Frossie, take a look at the tap slide.
KT: sometimes, I found string reps of numbers compress better than binary reps. It ends up being more verbose but compressing better. AS: In this case, it's not numbers that are the problem. KT: Sounds not worth investigating.
YA: Reconciling this what you said earlier about pandas being faster than afwTable. So converting to pandas dataframe is slow. But, converting to afwTable is slower? AS: Yep EB: The AP pipeline uses pandas and I'm not comfortable committing to refactoring at this point. AS: You can save money by having a smaller cluster. CS: It matters where you're doing it because it determines what you build into the AP cluster vs. the APDB cluster. Not fair to include the timing of the client when measuring the scaling of the DB server cluster. FM: I just want to bring to your attention that this conversion takes time, and you should be aware that it's there. [Notetaker's aside: There is lower hanging fruit on the client side]
IS: It would be nice to see scaling with a factor 10x higher source density for the performance of outliers. AS: I'm worried about averages. The total number in the database is what's important.
FM: Cassandra has been holding up as a horizontal scaling strategy. $ In cloud vs. vendor: 1-2 years is the breakeven point The good thing is both are fine. The bad thing is it's a tough decision. MB: Make sure you include the test systems too! Also, can you grow this, or do you need 12 right away? FM: It'll only save us half a year at most. But I'm open to your advice.
AS: Code is on a separate branch that has diverged significantly. EB: We took some action with Michelle and co and are using Postgres at NCSA. It's working out. We're working on gen3 migration issues and schema changes that we can put in. We have a way forward that's functional. It wasn't obvious that even during commissioning, would we hit the scale where we need Cassandra. How and when should we test this. Maybe we can separate it from when we go on sky timeline.
FM: It sounds like you're OK with your Postgres solution at NCSA. We'll evolve that with the Cassandra API over the next few months. Do you think Postgres will get you through commissioning? EB: evolves with how long commissioning, but have to check with commissioning and the team.
Wil: How big is it? 25-30TB per year? total 300TB KT: Archive it KT: Google has a way to snapshot your disks. Oh no google? OK, well, still archive it.
- Was the bug detected with the fake source injection pipeline an apperture correction bug? Yes.
- How well are things working in Decam HITS and bulge fields? Technote for bulge data. Differencing artifacts remaining. Please ask Ian for more numbers.
* Princeton
- Please leave Jim alone on Focus Friday
- WHat is the limit factor of the 3 iterations of HSC-R2? Waiting time is less for Gen3. Having more reruns would help "slightly"
- Is there any way to get Leanne's group to look at Gen3 outputs? Triage process requires a lot of experience from Lauren.
- Tucson teststand PDUs delayed till April -- RHL in chat
- No change in development, NCSA still working with Tony on main camera -- Wil in response to slide
* Architecture
- Build engineer ad out, please share -- Wil
* DAX
- No questions
* SQuaRE
- Where are we with the Science Platform landing page? squareone under early development, released next halfcycle
* Science
- Has Focus Friday really impacted Stack Club? [Resumption of earlier discussion about better cover for Stack Club by going back to assigning people to mind it - Wil will discuss slide phrasing]
* Wrapup
- Thanks to all for a good meeting
- Let's use Slack instead of Zoom chat for side discussions next time [but then we can get distracted -- KT]