All technical sessions of DMLT vF2F will be open to DM members. Should there be need for a closed session it will be marked "DMLT only".

Logistics

Date

2022 June 7-9

Location

This meeting will be virtual;

Join Zoom Meeting

Meeting: https://noirlab-edu.zoom.us/j/96067263847?pwd=dW1DSUpHR1NEMFZmcE1YUTRYb1ZPQT09

Meeting ID: 960 6726 3847

Password: 772575

Excused:

Gregory Dubois-Felsmann will not be available 1pm-2pm Tuesday and 12n-1pm PDT Wednsday.

Attendees (please add yourself if missing):

Day 1, Tuesday
Time (Project)	Topic	Coordinator	Pre-meeting notes	Running notes
Moderator: Kian-Tat Lim	Notetaker: Ian Sullivan
09:00	Welcome - Project news and updates	Wil O'Mullane		Review of action items. Request for RHL to sign off on write up of campaign management Project news Q: KTL: based on BOT review, when needed should we change the requirements or get a non-conformance waiver? WOM: It depends on the requirement. If the requirement is very clear, it is best to do a non-conformance GPDF: In some cases (software) you really can do it later LG: Some requirements will get punted to Operations FE: We have to watch the requirements and make sure we can deliver a working system WOM: Yes, a good example is the compute hardware for the first year of the survey which we could defer to Operations and save some hassle and money. CrS: If you are coming to the summit, make sure to send Cristian Silva a Slack message to set up space on the bus Need to look over the scope options Wil posted in #camelot recently Several options on alerts, Ian Sullivan or Eric Bellm should make sure to look at them
09:30	PPDB, APDB, User databases - status update (15m+15m Q&A)	Andy Salnikov	What is the status on choice of technology for the PPDB, APDB user databases?	EB: You stated that Solar System processing will run off the PPDB Will APDB replication to the PPDB be done in time for SolSys to start right away? AS: We can do some replication in small chunks through the night. When observations stop for the night, I hope we can do the replication in less than one hour. KT: It's nice to run the Solar System processing off the PPDB since that is all public, we would have to implement additional filtering if it was run using the APDB EB: Are we limited from putting measurements out within 24 hours? KT: All except unapproved streaks EB: Newly identified objects from the MPC will need to get replicated back to the APDB before the next night's processing FM: The hardware for the APDB is expected to live in the OGA rack at the USDF, so this can't be tested until that rack exists KT: With the hybrid model this could be in the cloud, which would have scaling benefits for users FM: APDB is concretely defined by construction requirements, but PPDB is not as well defined. Would be helpful to attach additional requirements EB: AP is concerned about integration risk, and will be ready to test as soon as the hardware is ready LG: To be clear, this is a decision to use Cassandra for the PPDB? WOM: Yes
10:00	Metric persistence in the Butler	Tim Jenness	Tests using DC 0.2 to test loading and querying thousands of metrics persisted as lsst.verify.Measurement objects ( DM-34556 - Getting issue details... STATUS show ~ 0.3 to 0.5 sec per lsst.verify.Measurement object. This is not tenable for science verification work. I understand there are plans to use JSON instead of YAML as the storage format to address this? What are the concrete plans and timeline for implementation? slides	TJ: Many metrics needed for visualization from very different people and teams. Starting to put together a technote with Jim Bosch describing what middleware thinks about metrics: DMTN-203 Non-middleware team solutions FE: Where will the parquet files live, and how do you find them? JB: They are butler datasets Aggregating metrics KT: Concern that aggregated metrics will obfuscate errors, delay error reporting NL: You can do both: a few metrics that are key to assess quality, and the rest aggregated at the visit or tract level YA: We can use the ccd-visit table for metrics in DRP, what about in AP? EB: Not clear you'd want a code path that relies on the APDB for AP metrics Middleware team solutions Sasquatch-backed datastore TJ: In Influx2 there is no problem with one dimension associated with multiple timestamps Consolidated Database Registry storage and queries See DMTN-220 FE: Would still like some of the Consolidated DB functionality from the registry TJ: When we are running 10,000 jobs simultaneously, we can't be querying back to the SQL tables IS: Would this allow building a quantum graph based on metrics in the registry? TJ: It would be hard to pull in data from the EFD (say, wind speed), possible to query on e.g. seeing JB: It is important to distinguish between ad-hoc processing runs, and runs during production In production, it is important to build that logic into the Task, so that it is not required that the user operator sets things up right KB: During commissioning we will often be trying to identify different sets of data, which may need to be flexible to select based on arbitrary metrics TJ: The registry can't know all of the metrics in the EFD FE: When and how do the quantities that get calculated in the graph get written TJ: Everything is a Butler.put KT: An afterburner can be a PipelineTask that is run right away TJ: You don't want to delay the issuing of alerts because you are writing metrics to the summit TJ: OCPS will be generating metrics for the summit KB: We will talk more about the connection to FAFF later
10:30	Break
Moderator: Leanne Guy	Notetaker: Wil O'Mullane
11:00	Data Quality/Integrity at all levels and timescales. Potential workshop later this year. SLIDES	Yusra AlSayyad	How do we go about looking at algorthim.hardware quality issues ? Can we come up with better terms and stop using QA as catch all .. I would also like to understand how to get Faro in the main stream development.	QA of what ? LG: Quality Assurance - is a process, same process could be applied in many places. Verification of requirements and validation of purpose KB: ready to begin 10 year survey. Documenting that we can meet the goals CS: Operations its the products . Yusra: operations is QA of the running system - the previous 3 are construction (Pipelines, Sci Validation, Comm System V&V). ZI: is there a doc and a hierarchical representation. YA: Is new, we discussed the hierarchy - the code in AnalysisDRP will be the doc. SO the truth can be extracted? ZI: How do I know if you have a metric X e.g. scatted of Astrometry around Gaia - how do I know if you produce it ? YA: For metrics ask butler, all in lsst_verify package. Want to have a place where they are defined in a reproducible manner. RTN-083 should provide the algorithm implementation for SRD metrics. . ZI: similar problems with survey metrics .. could apply some of the same ideas there. JP:I'm still wondering how I learn about what is in the plots/how to format my own plot. YA: we learned in March - there will be doc and tutorial for bootcamp at end of june. In next weeks look at analysis drp - copy and paste one . JP: But how do you know whats in the plot - YA mentioned color and position ? How do I know what blue means ? NL: It is labelled in the plot. LMA: talked through plot a bit more to say which SN was used etc .. its all on the plot. ON Faro Analysis tools KT: now no distinction between metrics and plots, more QA for X vs QA for Y NL: there are pieces which can be put together in different ways to make metrics and plots, there will be pipeline tasks for them KT:Faro will still contain metrics and plots analysis DR will still exist with different kinds of plots YA: As of know Analysis_Drp and Fara will mainly contain the pipelines LG:Faro will continue to have implementation of metrics, all catalog matches etc done in analysis tools. All the normative ones .. and perhaps others. YA: not always .. you want them in analysis tools LG: what about normative metrics - JC: Analysis tools would be good to have a canonical place to find metrics ZI: happy direction of progress - put it under one umbrella. Beware of open ended projects. Don't underestimate how important and general this is .. not just commissioning but also science collaborations. Make training persistent if possible. NL: that is front of my mind ! Missing functionality .. WOM: Would be nice to have terms like instrument health instead of all QA .. KB: Just used QA in the title since it was the topic .. FE: Gregory and Angelo to FAFF. We had working groups they have a charge and end - do not want standing committees. KB:Note that FAFF is not intended to implement these capabilities, but rather to define what is needed FAFF Round 2 resulted from the need to better define what is needed in terms of data processing and data quality assessment on the minutes to 24 hour timescale WOM: ZI want a workshop - would be Oct after CalibPalooa.
~~11:45~~	~~User batch some(many) open questions~~	Wil O'Mullane	first draft https://dmtn-223.lsst.io/v/DM-34198/index.html contains questions in bold we can look at the pdf. Think and answer offline.	20 Jun 2022 DMLT provide feedback on DMTN-223 20 Jun 2022 Wil O'Mullane to set up separate batch meeting to bring in other concerned parties.
12:00	Prompt Processing	Kian-Tat Lim	How to get PP running on auxTel data. Is transferring data from Cerro Pachón or USDF to Google Cloud still the best way to make progress? Would it be better for someone get it running native at USDF instead? There's a proposal of an AP + AuxTel sprint in the last two weeks of July.	EB: APDB is also an output - in addition to butler exporting files and catalogs there are a bunch of other things . KTL: some things could be released sooner if not pixel or measurement. Feedback to summit in the ICDs are all scalar metrics .. IS: short term alert dist will not be available. (in summer) FE: Kafka Alert Tooling - with Spencer gone is there a new dev EB: New hire in Aug. Spencer got as far as writing to AlertDB (object store, GCS bucket) EB: We can use postgres as APDB at USDF now .. should have templates from LATISS(there are suitable images) SP: Is tony going to push to USDF at same time as OODS ? What about Network outages KTL: yes or even before - drop it if network is out. Have not code to go back to DAQ. RHL: Not clear how much effort it is going via cloud, thought we were using OODS for prompt. KT: Cloud vs USD - zero for cloud .. we have to do it anyway USDF is a super set of tasks For OODS - there is use for the delayed image butler, puts go there. As to if this is commissioning cluster - no reason why not. RHL:looks different KTL: its a different way to run pipe_task, could replace some of this with PanDA (discussion for later). This is using workers essentially as their own pods.. some advantage to batch system but may not work in this context. EB: end July some time from AP people .. cloud only not USDF unless huge push. KTL :will support that. RHL: also Tiago to support from mountain.
12:30	Break
Moderator: Wil O'Mullane	Notetaker: Kian-Tat Lim
1:00 (DMLT only)	High-visibility roles for senior team members	Jim Bosch
1:30
14:00	Close
Day 2, Wednesday
Moderator: Ian Sullivan	Notetaker: Colin Slater
9:00	USDF status and Hybrid model	Richard Dubois	Hybrid what is it ? USDF transition plan - brief presentation time for questions. RTN-021 and DMTN-189	Frossie: may still want to host a partial qserv on the cloud. Fritz: would not recommend qserv below a certain size scale, if it's just object-lite then may be better off with a conventional DB Frossie: would prefer butler client-server to have the server in the cloud, and it can do the backhaul to SLAC KT: client-server butler is a preerequisite for anything. Have as much information as possible in the cloud, only the large data coming from SLAC. would enable better caching in the butler. User storage only at the cloud, not at slac, better for quota management etc. But then batch processing is complex, release data is separate from user data in that model. Richard: HEP often has users submitting to "anywhere", but how the user data gets to that compute is tricky. Gregory: can't assume that all user file workspace data is stored in the butler. Ian: what limits are there on transfer e.g. of the coadds? KT: could put throttles in the various VO services. But hasn't been done. DESC is clearly asking about direct transfers. RHL: what is the definition of when we need to be ready? Are you thinking auxtel or only comcam? August 15, all data has to go to SLAC. Would hope that the auxtel mechanism is the same as the comcam mechanism. Jim: Is the plan for transferring NCSA Postgres just dump/load? Hasn't been any discussion yet. CTS: what fraction of /repo/main is going to SLAC? All of it.
9:30	Status of Campaign Management planning (pdf, pptx)	Eric Charles	plan and how it might be implemented.	Wil: interesting that you "check off" some of the datasets manually, seems labor-intensive. In Fermi they mostly just look at warnings and follow up on those. Don't inspect every pipeline run. (missed question from Robert) Leanne: Gaia would send alerts. Fermi page shows a bunch of results; could write something to look at that and send alerts. KT: expect metrics to go to something like sasquatch, and then alerts go to OpsGenie. Jim: distinguish between defining the inputs to a campaign (what Robert is talking about), vs. splitting the inputs to separate workflows (what Eric is talking about). Jim: can't quite use just tagged collections for grouping, since they only allow you to group the inputs to processing. So we need a way to group the output dataIds, have planned for a while but not quite implemented. Richard: do you expect campaign management to be "auto-filling" the queue with work? Yes. KT: why port to the cloud? Not a big difference but some amount of work to make things portable. Yusra: 100sq deg HSC run was delayed so that we can do it on USDF. Once these tools are ready, it would be good to try them out. Follow up Richard Dubois Tim Jenness Eric Charles Wil O'Mullane Plan a longer discussion about how to staff/organize the campaign management work 22 Jun 2022
10:00	Antu - commissioning	Cristián Silva	Cristián Silva Robert Lupton What do we need for commissioning and when to order it .. what needs to be installed on a cluster at the base or summit do we need processing a the base or should we rely on USDF ? how does camera diagnostic cluster fit in .. Outside DM there are people we should probably bring into this discussion (Tony, Patrick?)	Frossie: if it needs to be at the summit, better to just add the nodes to (an existing summit system?) rather than keeping it separate. If it's acceptable for it to be off-summit, then might as well be USDF rather than Base. There's a cost to having "yet another cluster," want to make sure it's worth it. KT: separation of workloads is one of the reasons for separating them. And had originally hoped that the comm cluster wouldn't need to be DDS-enabled. Reasons to have it in Chile: try for lower latency, keep function if LHN goes down (but LHN may have better reliability than the base-summit link), sharing long term storage with Chilean DAC. RHL: Proposes to move commissioning cluster to the summit. OSS is out of date on what is at the base. Summit-base link has been unreliable. Microwave is coming, but not big enough for data (only control). 95% of the time expect to be able to use the USDF. But need to figure out the minimum compute we need for when the networks are down. FAFF2 may work on that. Cristián: When would you be using this? RHL: Ideally, Would like to be running realtime auxtel analysis in six weeks. Not going to happen in six weeks. (understood). Wil: what does the commissioning cluster run? Something that looks at the last 24 hours and says is the data ok. Not the same as prompt processing. If we're doing this, then it's more of a summit compute facility and not a commissioning thing, we should call it something different. But also there's the diagnostic cluster? Merge this (Antu) with either diag cluster or Yagan (telescope cluster at the summit). Frossie: role is muddied. From Frossie POV, there are two types of clusters, telescope clusters (DDS enabled. controlled) and science clusters (fast rollout). Don't mind having two clusters at the summit, but should decide if this is DDS vs science. In early days, we said we could live without the network for ~days, but now we've adopted things like the science platform that we can't live without for that long. Need to do an inventory for what software are must-haves during a night without the network. If QA is necessary but we can live without it for ~3 nights a year, then USDF. Cristián: all for having fewer clusters. KT: inventory of: things that need to be at the summit because they rely on pixel data, or because they're needed for observing the next day. Second inventory: what things can survive being cutoff from the summit; hope that microwave means that nothing will be totally broken if the fiber is broken, but if both are broken then many things may be hosed. Camera diagnostic cluster: runs camera visualization pan/zoom. Camera team will expect to run camera team tools on it, their particular database/display utils. Simplest thing is to keep it small. Wil: agree, let's leave the diag cluster out of this. let's write a document. Cristián Silva Robert Lupton Wil O'Mullane Write an ITTN proposal for where the commissioning cluster goes, including a diagram. 29 Jun 2022 KT: still lots of questions about who owns the services on top of this cluster: butler, rsp, etc.
10:40	Break
Moderator: Kian-Tat Lim	Notetaker: Wil O'Mullane
11:10	Milestone parade	Wil O'Mullane	Sparse milestones until Sept 2023 .. 1a done epics/milestone - Leanne has a test plan DM-32340 - Getting issue details... STATUS (not tagged nor linked to milestone in LDM-503). DM-17133 - Getting issue details... STATUS should we modify to say all1a verified on this ? https://docs.google.com/spreadsheets/d/1TUIUf84qHX5QfcCNWs27HGKlgHmCKcCs1IpBP5ODNDA/edit#gid=2061634320	32340 - add 503 milestone. - Still need the DAX milestones ..needs frossie and gregory (Provenance, VOSpace, ..) Qserv milestone was added by Kevin.
11:39	Plans and Status I DM Science DAX DRP NCSA SQuARE	Leanne Guy Fritz Mueller Yusra AlSayyad Steve Pietrowicz Frossie Economou	>>> presentors = ['Leanne', 'Cristian', 'Frossie', 'KT', 'Fritz', 'Stephen', 'Ian', 'Yusra'] >>> random.shuffle(presentors) ['Leanne', 'Fritz', 'Yusra', 'Stephen', 'Frossie', 'KT', 'Cristian', 'Ian']	Leanne Calibpalooza august 25/26 Yusra did say Gen3 once (just to note she did not). KT is TM Gateway a new thing or one of the previous incarnations ? Its write to sasquatch at USDF then say which topics need to be replicated back to summit (SQR-068) .. KT happy with this. Demo of the cached runable parameterized notebooks (times square)
12:30
Moderator: Wil O'Mullane	Notetaker:
1:00	Arch (.key) IT-Devops AP	Kian-Tat Lim Cristián Silva Ian Sullivan		WOM - MR mentioned UWS on TTS problem .
1:45	Wrap up	Wil O'Mullane	Next vF2F Oct 18,19(,20)	On cutting items at time (or a little over) (IS)should keep buffer at end of session to allow run over (FE)ok as long as we schedule follow up

Day 3, Thursday - Open for spill over topics
Time (Project)	Topic	Coordinator	Pre-meeting notes	Running notes
Moderator:	Notetaker:
09:00


12:00	Close

Originally Proposed Topics

ANalyse-DRP and Faro	Wil O'Mullane		I would like to understand the state of convergence here an dhow to get Faro in the main stream development.
Antu - comissioning	Wil O'Mullane	30mn - 1 hr	Cristián Silva Robert Lupton What do we need for commissioning and when to order it .. what needs to be installed on a cluster at the base or summit do we need processing a the base or should we rely on USDF ? how does camera diagnostic cluster fit in .. Outside DM there are people we should probably bring into this discussion (Tony, Patrick?)
High-visibility roles for senior team members	Jim Bosch	30 min open + 30 closed, ideally	We have a number of people (at least in Science Pipelines) who have been on the team for a long time and contribute a lot, but don't regularly do the kind of work that leads to visibility outside DM or even outside their own DM team, and in some cases they are feeling a bit left behind. I would like to discuss ideas for recognizing these team members and improving their external visibility, and especially get feedback on some ideas from those on the DMLT who have a lot more experience managing people than I do. My main idea is to define some new roles with real responsibility over technical aspects of DM that are both of interest to the person taking the role and genuinely in need of more attention or coordination. I would like to have both an open session to discuss this in the abstract and a closed session to talk about an initial batch concrete roles for specific people.
Metric persistence in the Butler	Leanne Guy	30mins-1hr	Tests using DC 0.2 to test loading and querying thousands of metrics persisted as lsst.verify.Measurement objects ( DM-34556 - Use DC 0.2 to test loading large numbers of metrics In Progress show ~ 0.3 to 0.5 sec per lsst.verify.Measurement object. This is not tenable for science verification work. I understand there are plans to use JSON instead of YAML as teh storage format to address this? What are the concrete plans and timeline for implementation?
Need to shift DMLT	Wil O'Mullane	5 min	https://doodle.com/poll/cd37rsi7ghbnx2ce?utm_source=poll&utm_medium=link LSST Europe is 24-28 October (currently our DMLT vF2F) : https://sites.google.com/inaf.it/lssteurope4 Suggest we shift to Oct 18
PPDB, APDB, User databases - status update	Leanne Guy	30mins-1hr	What is the status on choice of technology for the PPDB, APDB user databases?
Status of Campaign Management planning	Richard Dubois	30-60 mins	Eric Charles has been talking to the key stakeholders and can report on his plan and how it might be implemented.
User Batch	Wil O'Mullane	30min	DMTN-223 has open issues/questions

Attached Documents

Action Item Summary

Description	Due date	Assignee	Task appears on
Frossie Economou Will recommend additional Level 3 milestones for implementation beyond just the DAX-9 Butler provenance milestone. 15 Mar 2022	15 Mar 2022	Frossie Economou	DM Leadership Team Virtual Face-to-Face Meeting, 2022-02-15 to 17
Kian-Tat Lim Convene a meeting with Colin, Tim, Robert, Yusra to resolve graph generation with per-dataset quantities (likely based on Consolidated DB work). 18 Mar 2022	18 Mar 2022	Kian-Tat Lim	DM Leadership Team Virtual Face-to-Face Meeting, 2022-02-15 to 17
Frossie Economou Write an initial draft in the Dev Guide for what "best effort" support means 17 Nov 2023	17 Nov 2023	Frossie Economou	DM Leadership Team Virtual Face-to-Face Meeting - 2023-Oct-24
Convene a group to redo the T-12 month DRP diagram and define scope expectations Yusra AlSayyad30 Nov 2023	30 Nov 2023	Yusra AlSayyad	DM Leadership Team Virtual Face-to-Face Meeting - 2023-Oct-24
Gregory Dubois-Felsmann Complete DMTN-105 defining the goal for "Prompt Products Release Ops" 11 Dec 2023	11 Dec 2023	Gregory Dubois-Felsmann	DM Leadership Team Virtual Face-to-Face Meeting - 2023-Oct-24

Space shortcuts

Page tree

Logistics

Date

Day 1, Tuesday

Day 2, Wednesday

Day 3, Thursday - Open for spill over topics

Originally Proposed Topics

Attached Documents

Action Item Summary

Space shortcuts

Page tree

DM Leadership Team Virtual Face-to-Face Meeting, 2022-06-07 to 09

Logistics

Date

Day 1, Tuesday

Day 2, Wednesday

Day 3, Thursday - Open for spill over topics

Originally Proposed Topics

Attached Documents

Action Item Summary