Victor will want a plan to the end of construction
Identify parts due to COVID and parts that are not
Look at burn rates — how far do we get with the money we have?
Assume we need to go to end of FY23 (Oct 2023)
Frossie: might as well show sheets if we have them
Transition of personnel to Operations needs to be checked
Can adjust ramps a bit, but delaying people doesn't get any more money into pre-Ops
Leanne: Zeljko mentioned that DM was funded through Construction, so is there extra money when people go to Ops? Answer: there was always a ramp, so no extra money
Do remaining milestones make sense? Do we have effort to achieve them?
When are things operational?
Baseline: When all 1b requirements are verified
DRP, reliability would then be pushed to Ops
Want to declare pieces operational earlier; they would still be in maintenance
Frossie: Handing over to Ops doesn't mean cessation of development
But pre-Ops money is problematic to use for development
Leanne working with Jeff on tying milestones to requirements; currently in spreadsheet, then will generate test plans and add more milestones if needed
DMTN-158 could have a list of requirements in the YAML file for each milestone
GPDF: A lot of milestones will have many requirements; test plans may be better places for requirements?
But need to be able to tell people when requirements will be met
Leanne: looking to extract from Jira so we have one source of truth
Justifying COVID expense:
Delayed prerequisite milestones
E.g. where we've been delayed due to not getting data
Variances in P6
Frossie: standing army waiting for things is not included?
EVM says we should deliver DM in Oct 2022 on budget
Presented spreadsheets showing burn rates and needs through end of FY23
RHL: How does proposed validation from algorithm candidates map to phase 3 when we do validation? LG: I expect the authors of algorithms to be involved. If they don't provide validation, those algorithms will probably not be selected. RHL: Retraining is also the responsibility of the authors. JFB: +1 CS: Nothing said to the proposers gets us off the hook for anything. We can say that they have to validate and train, but we're still responsible if they don't. KT: Do you see letters of rec from people other than their authors? LG: It's possible USERS of algorithms will recommend. JFB: Would "statement of interest" be more consistent? LG: That sounds vague too! RHL: If they don't do more than write a letter, then it won't be useful. CS: We're not being explicit here about the project's responsibility. RL: We never said we were going to deliver photo-zs. If no one in the community steps up, we'll have to put 0.1 for everything! JFB: I agree with Robert that if no one does, this should be the first thing to get descoped. It's the thing that the community is better at. Operations. YA: The backup plan isn't as scary as 0.1s. One of the first projects that DESCs new pipeline scientists have started is a photo-z estimator (Schmidt, Malz, Charles et al.) I bet they'll have a sufficient backup going. CS: In that case, we should move the timeline earlier. JFB: Agree. Clarifying the lines between the two groups early is good. We should get something written down so we don't get in a "I thought you were going to do that!" situation.
KT: If an author team proposes a photoz and we don't select it, is it still an in-kind contribution?
We are in a transitional period. Gen3 just released to science pipelines; and awaiting feedback from users before deciding where to focus new development. In the meantime: potential discussion topics are:
Jim Bosch's proposal for repo organization of precursor data (
Jira
server
JIRA
serverId
9da94fb6-5771-303d-a785-1b6c5ab0f2d2
key
RFC-741
). Resolving open questions about proposed changes to filesystem locations and access controls should probably take precedence at DMLT.
Tim Jenness's expectations on pipelines use and feedback.
Yusra AlSayyad's plans for pipeline conversion and expansion?
Test Plan Discussion:
RG: We don't have a test plan: part my fault, part testing framework. Some tests are running now. Monika is running RC2. There was a request that she run all tracts together, but if you give it multiple things to run, the batch system will behave the same terrible way that the DES had: If one exposure has a problem, the job halts, and you have to work out what happened and restart.
Test plans: Where do you balance modularity in the test plan with just getting done and being done with it?? "I can ingest a comcam exposure" "I can ingest an auxtel exposure." etc.. Do I write a test plan for each one? In the ops rehearsal, I wrote out each step of the test serially. You can't rerun it ever again.
WO: Re mechanics of testing, worth having a chat with Jeff Carlin and Leanne. This might be one-off, though.
RG: It'd be worth writing down: this is how you ingest a raw comcam exposure. "Go get a raw exp, and do an ingest. check No/Yes"
This is not the right way to do this. I'm authoring it. Executing it, and arguing why what I did was fine.
GDPF: What RG says about "self dealing" has been the exact same for the science platform. It'd be nice if we had someone with an independent point of view.
TJ: I was expecting this to be more collaborative with Science Pipelines.
RG: No one else has time for that either.
This is more important for telling external users that is fit for use.
RHL: We can help with my integration work. The "does it ingest" is coming from the outside.
JFB: One prob is we wrote a Big bang milestone for a gradual process. It's inching ahead. We just now declared that the schema is stable, and it is worth switching dev to daily work. The important part of the milestone is schema stability, and the rest is the box-ticking exercise.
RG: This is one step more formal than the boxes on the confluence page. I'm not saying that I shouldn't be doing what I'm doing.
TJ: We can declare that we're adopting DM-DAX-12 whenever we want. If it's a 503 we can't.
All. Be prepared to provide feedback on whether test plan (LVV-P77) provides the tests and rigor necessary to declare Gen3 open for DM use/development.
TJ: When we mean pain points, we mean command lines being weird. That means usability. happy to make improvements
RHL: Where do we stand on remote butler access or butler exports? TJ: My client/server work includes this. You can do a local ingest into a local sqlite and the URIs are a remote archive that downloads on demand. Have to make that all work for RSP support. FE: We don't give people infrastructure accounts, everyone is a user of the services. WO: I support not using user accounts inside the DB. Yes, we did this with skyserver, and CADC does it. you need to know who the user is, but you don't need database accounts
CS: Are there other worries you haven't enumerated here re PP? TJ: We have the problem of not knowing when the visit is finished. JFB: We're not running AP through BPS. We need some execution environment. K-T: One worry I have is whether we have multiple pipeline starts or a blocking operator that waits for data to show up. We should discuss.
RHL: We can't special case Alert processing. we should be using generic mechanisms. JFB: on the point of AP vs. not-AP. The question isn't are we going to have these other things. Are they ENOUGH like AP, and is it hard enough to write one of these custom executors? I predict it's not hard to write one more.
Task: Discuss. Someone has to write a Pipeline.yaml. If its writing a new pipetask for this, then we have to talk about it.
RHL: There's a layer of controlling the processing, that we're not paying attention to. e.g. maybe we define the visits outside. KT: Jim had "Specifiying an external set of dataIds" on his list and a slide on it. [We all behold Jim's slide] TJ: How does that get passed to the next task? JFB: We can define a summary dataset. TJ: I think there's more; let's talk offline
KSK: 1) Can you defer loading? Yes. Look at coaddition and FGCM. 2) How do you get the provenance so you can see what data has actually been used? Hasn't been written yet. 3) Custom datastore that writes metrics to the database and then gives the registry access to those columns.
RHL: How do we manage those lists. Who owns the job of this? Frossie is going to say that we cal use OWL but what DB does it talk to? I'm willing to define it to not be middleware, but someone has to own it. There are few tables Jim put in the butler for good and bad exps, but that doesn't cover it. KT: You have access to a wide variety of databases where you can query for a variety of exposures. RHL: and I want it to track WHY I did it. KT: That's a lab notebook. RHL: that's a Rubin deliverable. KT: Lets figure out how Frossie: OWL will go towards this. But its not a processing control system. TJ: there is absolutely a gap.
Jim: Get someone to try doing ourselves for a couple of days
Yusra AlSayyad will look into what is needed for HiPS and write epics with Gregory Dubois-Felsmann, but there's no commitment that this will be in the next cycle
Server should be pretty trivial
Could partner with SPHEREx for some development, as SPHEREx will be generating HEALPix all-sky maps
The AP team is now testing precursor datasets large enough to require real databases. Do we push forward with Postgres at NCSA? Try to integrate Cassandra?
More broadly, can we discuss the path towards the DM-AP-16 ( Full integration of the Alert Production system within the operational environment) milestone, with a view towards commissioning and pre-operations activities?
User account and rerun structure needs to be solved regardless of database system
Fritz
Is the DB access abstracted through the AP API built by Andy S.
The access is through the API, but later analysis is not
Frossie
How/why do we use Cassandra?
Fritz
There are some technotes
High concurrency
Spatially restricted queries with low latency was not otherwise available
Simon
Cassandra apparently supports RBAC
K-T
The API question is a good one, though; we could do a "friendly user" setup with everyone sharing a single account and use the API to keep people separated.
Fritz
Where DAX is:
Still need to prove out to full year of simulated AP
Have targeted Google cloud for next round of experiments
Andy S will run next round of tests this coming year
Will have to be a productization phase
Do we run it at NCSA or in the cloud
Eric
It’s not clear that we will actually need Cassandra even in commissioning
Wil
I thought it was after 6 months of data that we needed it
Fritz
Is there a shim to make Postgres work right now?
Eric: That’s what we’re doing
Reading the technote, it’s not easy to set up Cassandra, would need Andy S.
Should try to bridge the gap with Postgres now
Fritz
We need to know what environment we will run in
NCSA or Google cloud, or USDF
K-T
The fastest solution might be to take an environment variable and tack it on to the user name
Colin
To clarify, Eric is describing ad-hoc usage where there are many things going on
For that, Postgres sounds sufficient once we solve managerial problems
Wil
There are probably only a handful of calls AP uses
Fritz
How much of that is inside the API and how much is outside
Eric
Ian and I need to look at the API
Colin
In the ad-hoc realm, I don’t want to constrain people
Wil worried about the pipeline code, Colin: that’s all in the API
Fritz
The perceived complication of using Cassandra is the complicated configuration needed to tune it
Once it is set up for production, it should be easier to set up a new instance
All the other pieces for integrating AP
Wil
Want this to be ready when the whole camera is on sky
K-T
After OCPS is working (little more than a month)
Then want to take one AP pipeline and plug it in and run it
Gives us minimum functionality
We already have DAX simulators on the test stand, can use that
Robert
Won’t we use OCPS first? Yes
Wil
Full integration could be what K-T is saying, and could be enough for the first year
No requirement on timing of alerts during commissioning, just that we do them
Could add a second milestone for later for full working system
Fritz
How do we manage the gap and color of money issues between NCSA and the USDF or Google?
Wil
Talk to Richard D to get hardware at SLAC
Could potentially complete milestone a couple months into operations
There is no point in testing integration before we are ready
K-T
All of this can be tested at NCSA
Wil
Could run at NCSA or Google
Frossie
What is the Alert Pipeline scale when ComCam goes on the sky?
Wil
ComCam is only 10% of focal plane, 5% data at best
Wil
Yes, we have enough machines at NCSA, because we do not have to generate alerts for every exposure, and not in real time. Will be best effort basis
Fritz
First I heard the latency requirement is relaxed in commissioning
Leanne
We told the community we would package up all the alerts, but with no expectation on latency
Colin
During commissioning we have to prove we can do the real thing
Frossie
Nice to prove we have a working thing, even if it is not at full scale
Eric
We have to demonstrate that we can meet the requirement, but we don’t have to do that with full focal plane and at a sustained rate
Fritz
Running one CCD in one database isn’t very different than running many in many databases
Colin
Is OCPS all I need to run AP?
K-T
It is not the designed component to execute it, but we will see if it can do it
The prompt processing system is the designed component, doesn’t exist yet
Colin
Is somebody building this?
K-T
Nominally yes, but no one right now
Thought it was part of NCSA WBS, but is probably in a grey area
RHL
K-T is experimenting with a more flexible system
Even if it doesn’t work out, it’s still progress towards a functional system
K-T
That’s the problem right now, there’s no backup
RHL
Possibility of using Auxtel to test AP
Early next year, could do end-to-end test
There are filters in place, can use it as a camera
Wil
Technically we have almost a year into Operations before we need to distribute alerts
Colin
There is a big push from the agencies and Zeljko to have a working system on Day 1 of Operations
New framework for metric computation – Leanne and Simon Called Fast (or Flexible) Analysis of Rubin Observatory performance (FARO)
Frossie
This looks phenomenal, the SQUASH system has been empty
Leanne
Simon and Keith have done a lot of work on this
K-T
Is the validate_drp used in this comparison running all the same things the one in Gen 2 was? Yes
Impressive that it is faster
RHL
What is your plan for scaling out to handle large quantities of data
Leanne
That is a high priority
We want to first complete our validation on RC2 and then move to analysing PDR2 when available
This is an afterburner, we need to have run science pipelines run first
RHL
Where do we discuss whether the metrics are good enough
Leanne
We have a Slack channel we’ve been using to develop this (#dm-svv)
Should turn that into a wider channel for all DM, for everyone doing QA
Wil
This will integrate nicely with pipelines, it could be added to the end of any pipeline? Yes
Simon
You can even interleave them
Run it after one step of the pipeline, then go on to make coadds (for example) and later run more metrics
Yusra
Science Pipelines are happy with this
RHL
Metrics are great, but we will use this to discover problems we didn’t expect
We need to pay attention to how this becomes useful to Pipelines, without overwhelming the service
Eric
On integration, we have stood up monthly QA meetings with the commissioning team and SVV
Does it need to interface with the APDB as an afterburner, or can you do more ad-hoc analysis
Simon
We understand there are some metrics people will want to run in-situ
That can live together with FARO tasks
Everything should still use the Butler
RHL
On the boundary, if we find discrepencies in the output whose job is it to drill down and find the cause
Yusra
We have found it useful to have a very senior person like Lauren in place to triage, and know who to direct it to first
Wil
In Operations, we’ve tried to put that group all together with Leanne. They can analyze who to send it to. There is no clear answer ahead of time whose job it is
DLP-526 states the archive center is complete at NCSA, can never be completed as written
LDM-503-14 Do we need to split these into 1a and 1b milestones
Leanne
Action: Add intermediate milestones (on Wil and Leanne)
Pipelines
Yusra
I don’t want anything holding up our releases
Leanne
We do the test to check for any major regressions, not whether specific milestones have been met
Infrastructure/Integration
Michelle
We should move the LSSTCam Ops out until after the camera is on the mountain
Wil
It is tied to that, but the camera team hasn’t updated their milestones
DAX Plan to end of construction – Fritz
Don’t have many DAX milestones, need to get more on that are appropriate
DM-DAX-5
Have a workflow set up with Hsin-Fan
DLP-802 Alert Production Database design
Need some time from Andy S, should complete in January
Need milestones that drive development and APDB integration
Features in the TAP service
Wil
Correct thing is that you have a milestone that you will deliver X, which is blocking Frossie
Fritz
Worried about past milestones that were not related to concrete design decisions
Wil
Need milestones for tracking when decisions must be made
We have two sets of milestones, for construction and for operations
It is hard to link milestones from outside the project
Simon
Can milestones be attached to multiple WBSs or do we need duplicate milestones for decisions that affect more than one WBS?
Wil
The level of the milestone reflects the breadth of subsystems it covers.
Simon
If there is something that needs to be decided by Arch for TAP, that would be a level 2 milestone
Fritz Mueller Get together with Colin and Frossie, and define milestones for RSP dependencies on Qserv (Notes: this meeting was held, resulting in tickets DM-29682 and DM-29683 as first steps.)
DAX estimate of effort to complete
Roughly 114 FTE months of effort left to complete, have 30 on construction and 120 in ops
Can some activities be shifted to ops, or do we need to revisit the ramp into ops?
Wil:
To finish, DAX needs construction funding through FY 22
Most ARCH effort is LOE, and milestones on the books look good
Among stated goals: eups-independence
Some staffing reduction
Wil: Some additional MW milestones needed?
DRP/AP (Yusra)
slides
The "milestone cliff" apparent on the milestone graph is coming up. A big part of this is pipelines (5 milestones overdue, and 10 more due in next 3 mos.)
Many of these are made much easier by arrival of BG3, so the should go quickly. Others are genuine concerns (see slides).
A big component of pipelines planning is annual review of DPDD, producing "annotated DPDD." This is coming up in January. Revisit what needs to be done with a hard eye toward what is really needed for DR1. Typically less rosy than just looking at milestones.
Some ongoing activities don't currently have associated milestones (shapes for shear estimation, inferred SEDs, CBP pipeline).
Concerns registered re. shrinking commissioning on-sky time for shaking things out.
Not a lot of milestones left, maybe need more, or not at this point?
"Full delivery of DF conops" milestone should definitely go over to USDF.
Twilight and transition plans for '22/'23 needed in greater detail for budget and people planning.
Wil: some USDF preliminary plan documents expected from SLAC Jan/Feb and will help clarify.
Wil: transition-to-USDF milestones for individual services/sub-systems seem needed
Wrap up discussion:
Reminder from Wil: we can use LCR process to move milestones as appropriate if we get to them before they hit the monthly report. Best is to tie them to construction or test milestones; Wil can help you identify these.
Covid impacts:
Perhaps 10-20% impact on development efficiency? (perliminary/speculative)
More on order of 1yr. due to commissioning / summit delays
Late decision re. USDF has also impacted schedules
Fritz Mueller Create shared Google sheet to collect T/CAM budget burn-downs
Wil: There needs to be a small test stand in Tucson before the NTS gets turned off
Frossie:
Wil: when UK/Fra deploy our science platform, do they get our auth? Frossie: European have on CILogin equivalent, she's willing to do the work to integrate with them. Wil: that's ID, but what about auth? Frossie: they will use our group management identity, we manage data rights for them
Frossie: schedule will be packed next cycle due to DP0 and commissioning, so less ability to respond to interrupts
KT:
Colin: what is the problem with the long-haul networks? K-T: usually transfers using multiple connections work fine, but single connections often die
Leanne:
Fritz:
Ian:
Colin: pytrax integrating to SQuaSH–Square? Ian: pushing metrics, not integrating, sorry
RHL: can you update with performance on various datasets? Ian: HSC bulge data still on deck. DECam bulge data made it through SFP and template building quite successfully on 99.5% of 20k CCD-visits. Diffim tests awaiting Postgres. Single-CCD diffim test looked okay. (longer discussion of other datasets)
Gregory:
frozen, but wanted to give an update on Firefly TAP capabilities
Fritz: is Postgres sufficient to drive your demo? GPDF: yes but there are some detail questions. Fritz: let us know if there are items that should be prioritized for Kenny
We said we'd do Focus Friday provisionally until this Nov meeting. How's it going? Do we want to keep it up?
strong votes of support from Frossie, Tim, Michelle
K-T reports one anecodal report: " I have hard time talking to people on the project and now I have 20% less time for it"
Fritz: missed reaction from his team, might prefer every other week (10% of 20% of our time)
Robert frequently gets stuck on his work on Friday because he can't ask questions, and it pushes people to email in non-public forum. Would prefer a no-meetings Friday, or a "no non-urgent questions"
Leanne also is concerned that communication is still happening but in private rather than public channels. Has reports from some developers that they are also hindered. Do agree with no regular meetings
Frossie: wants a form that can run every week, so people will plan ahead–"on a plane to Japan". Points out that burden here is differential–some people/teams send mainly outbound Qs, some people/teams get lots inbounds, 2 minutes at a time. Thinks perhaps people should be allowed to ask questions–but don't expect or plane for a response
Simon: disputes the question that development is less efficient, forces him to work through problems and batch questions rather than just pinging Jim every 5 minutes. Have to avoid demands for attention on Focus Friday unless something is blocking a whole team
Tim: appreciates that he doesn't have to catch up on lots of Slack messages, can plan on getting 2 story points
Jim: appreciates the time for his own productivity; not representative, as he does get lots of inbound Qs. Does think quick qs
Yusra: worry is about decisions being made in public channels
Ian: also supportive of an expectation of minimal discussion and no guaranteed response, but allow Qs. Does end up with telecons anyway on Friday due to non-DM folks
Simon: liked how Yusra directed conversation off of Slack and onto a relevant ticket
Wil: worried about slippery slope from "no messages" to "some messages" to "same as any other day" but will try to create some language that works
Wil O'MullaneDraft PR for Focus Friday to allow non-urgent Qs on Slack without expectation of response.
14:15
Wrap up
next DMLTs:
2021-02-22/25 - Nominally Tucson - Virtual I think so ...