All technical sessions of DMLT vF2F will be open to DM members.   Should there be need for a closed session it will be marked "DMLT only".


Logistics

Date 

2022 June 7-9

Location

This meeting will be  virtual;

Join Zoom Meeting

Meeting: https://noirlab-edu.zoom.us/j/96067263847?pwd=dW1DSUpHR1NEMFZmcE1YUTRYb1ZPQT09

Meeting ID: 960 6726 3847

Password: 772575


Excused:

Gregory Dubois-Felsmann  will not be available 1pm-2pm Tuesday and 12n-1pm PDT Wednsday.

Attendees (please add yourself if missing):



Day 1, Tuesday

Time (Project)TopicCoordinatorPre-meeting notesRunning notes

Moderator: Kian-Tat Lim 

Notetaker: Ian Sullivan 

09:00Welcome - Project news and updates
  • Review of action items.
    • Request for RHL to sign off on write up of campaign management
  • Project news
    • Q: KTL: based on BOT review, when needed should we change the requirements or get a non-conformance waiver?
      • WOM: It depends on the requirement. If the requirement is very clear, it is best to do a non-conformance
      • GPDF: In some cases (software) you really can do it later
      • LG: Some requirements will get punted to Operations
        • FE: We have to watch the requirements and make sure we can deliver a working system
        • WOM: Yes, a good example is the compute hardware for the first year of the survey which we could defer to Operations and save some hassle and money.
    • CrS: If you are coming to the summit, make sure to send Cristian Silva a Slack message to set up space on the bus
    • Need to look over the scope options Wil posted in #camelot recently
09:30PPDB, APDB, User databases - status update (15m+15m Q&A)What is the status on choice of technology for the PPDB, APDB user databases?
  • EB: You stated that Solar System processing will run off the PPDB
    • Will APDB replication to the PPDB be done in time for SolSys to start right away?
    • AS: We can do some replication in small chunks through the night. When observations stop for the night, I hope we can do the replication in less than one hour.
    • KT: It's nice to run the Solar System processing off the PPDB since that is all public, we would have to implement additional filtering if it was run using the APDB
    • EB: Are we limited from putting measurements out within 24 hours? KT: All except unapproved streaks
    • EB: Newly identified objects from the MPC will need to get replicated back to the APDB before the next night's processing
    • FM: The hardware for the APDB is expected to live in the OGA rack at the USDF, so this can't be tested until that rack exists
    • KT: With the hybrid model this could be in the cloud, which would have scaling benefits for users
    • FM: APDB is concretely defined by construction requirements, but PPDB is not as well defined.
      • Would be helpful to attach additional requirements
    • EB: AP is concerned about integration risk, and will be ready to test as soon as the hardware is ready
    • LG: To be clear, this is a decision to use Cassandra for the PPDB? WOM: Yes
10:00Metric persistence in the Butler

Tests using DC 0.2 to test loading and querying thousands of metrics persisted as lsst.verify.Measurement objects ( DM-34556 - Getting issue details... STATUS show ~ 0.3 to 0.5 sec per  lsst.verify.Measurement  object. This is not tenable for science verification work. I understand there are plans to use JSON instead of YAML as the storage format to address this? What are the concrete plans and timeline for implementation? 

slides

  • TJ: Many metrics needed for visualization from very different people and teams. Starting to put together a technote with Jim Bosch describing what middleware thinks about metrics: DMTN-203
  • Non-middleware team solutions
    • FE: Where will the parquet files live, and how do you find them?
      • JB: They are butler datasets
    • Aggregating metrics
      • KT: Concern that aggregated metrics will obfuscate errors, delay error reporting 
      • NL: You can do both: a few metrics that are key to assess quality, and the rest aggregated at the visit or tract level
      • YA: We can use the ccd-visit table for metrics in DRP, what about in AP?
        • EB: Not clear you'd want a code path that relies on the APDB for AP metrics
  • Middleware team solutions
    • Sasquatch-backed datastore
      • TJ: In Influx2 there is no problem with one dimension associated with multiple timestamps
    • Consolidated Database
    • Registry storage and queries
      • See DMTN-220
      • FE: Would still like some of the Consolidated DB functionality from the registry
        • TJ: When we are running 10,000 jobs simultaneously, we can't be querying back to the SQL tables
      • IS: Would this allow building a quantum graph based on metrics in the registry?
        • TJ: It would be hard to pull in data from the EFD (say, wind speed), possible to query on e.g. seeing
        • JB: It is important to distinguish between ad-hoc processing runs, and runs during production
          • In production, it is important to build that logic into the Task, so that it is not required that the user operator sets things up right
      • KB: During commissioning we will often be trying to identify different sets of data, which may need to be flexible to select based on arbitrary metrics
        • TJ: The registry can't know all of the metrics in the EFD
      • FE: When and how do the quantities that get calculated in the graph get written
        • TJ: Everything is a Butler.put
        • KT: An afterburner can be a PipelineTask that is run right away
        • TJ: You don't want to delay the issuing of alerts because you are writing metrics to the summit
        • TJ: OCPS will be generating metrics for the summit
      • KB: We will talk more about the connection to FAFF later
10:30Break

Moderator: Leanne Guy 

Notetaker:  Wil O'Mullane 

11:00

Data Quality/Integrity at all levels and timescales. Potential workshop later this year.


SLIDES

How do we go about looking at algorthim.hardware quality issues ? Can we come up with better terms and stop using QA as catch all ..

I would also like to understand how to get Faro in the main stream development.

QA of what ?
LG: Quality Assurance - is a process, same process could be applied in many places.
Verification of requirements and validation of purpose

KB: ready to begin 10 year survey. Documenting that we can meet the  goals

CS: Operations its the products .

Yusra: operations is QA of the running system - the previous 3 are construction (Pipelines, Sci Validation, Comm System V&V). 

ZI: is there a doc and a hierarchical representation.
YA: Is new, we discussed the hierarchy - the code in AnalysisDRP will be the doc. SO the truth can be extracted?
ZI: How do I know if you have a metric X e.g. scatted of Astrometry around Gaia - how do I know if you produce it ?  
YA: For metrics ask butler, all in lsst_verify package. Want to have a place where they are defined in a reproducible manner. RTN-083 should provide the algorithm implementation for SRD metrics. . 
ZI: similar problems with survey metrics .. could apply some of the same ideas there. 

JP:I'm still wondering how I learn about what is in the plots/how to format my own plot.
YA: we learned in March - there will be doc and tutorial for bootcamp at end of june. In next weeks look at analysis drp - copy and paste one . 
JP: But how do you know whats in the plot - YA mentioned color and position ? How do I know what blue means ? 
NL: It is labelled in the plot. 
LMA: talked through plot a bit more to say which SN was used etc .. its all on the plot. 

ON Faro Analysis tools
KT: now no distinction between metrics and plots, more QA for X vs QA for Y
NL: there are pieces which can be put together in different ways to make metrics and plots, there will be pipeline tasks for them
KT:Faro will still contain metrics and plots analysis DR will still exist with different kinds of plots
YA: As of know Analysis_Drp and Fara will mainly contain the pipelines 
LG:Faro will continue to have implementation of metrics, all catalog matches etc done in analysis tools. All the normative ones .. and perhaps others. 
YA: not always .. you want them in analysis tools 
LG: what about normative metrics - 
JC: Analysis tools would be good to have a canonical place to find metrics

ZI: happy direction of progress - put it under one umbrella. Beware of open ended projects. Don't underestimate how important and general this is .. not just commissioning but also science collaborations. Make training persistent if possible. 
NL: that is front of my mind !

Missing functionality .. 
WOM: Would be nice to have terms like instrument health instead of all QA ..
KB: Just used QA in the title since it was the topic .. 

FE: Gregory and Angelo to FAFF. 
We had working groups they have a charge and end - do not want standing committees. 
KB:Note that FAFF is not intended to implement these capabilities, but rather to define what is needed
FAFF Round 2 resulted from the need to better define what is needed in terms of data processing and data quality assessment on the minutes to 24 hour timescale

WOM: ZI want a workshop -   would be Oct after CalibPalooa. 

11:45User batch some(many)  open questionsfirst draft https://dmtn-223.lsst.io/v/DM-34198/index.html contains questions in bold we can look at the pdf. Think and answer offline.
  •  DMLT provide feedback on DMTN-223
  • Wil O'Mullane to set up separate batch meeting to bring in other concerned parties.
12:00Prompt ProcessingHow to get PP running on auxTel data.  Is transferring data from Cerro Pachón or USDF to Google Cloud still the best way to make progress?  Would it be better for someone get it running native at USDF instead?  There's a proposal of an AP + AuxTel sprint in the last two weeks of July.


EB: APDB is also an output - in addition to butler exporting files and catalogs there are a bunch of other things . 
KTL: some things could be released sooner if not pixel or measurement. 
Feedback to summit in the ICDs are all scalar metrics .. 

IS: short term alert dist will not be available. (in summer)

FE: Kafka Alert Tooling - with Spencer gone is there a new dev
EB: New hire in Aug. Spencer got as far as writing to AlertDB (object store, GCS bucket)

EB: We can use postgres as APDB at USDF now .. should have templates from LATISS(there are suitable images) 

SP: Is tony going to push to USDF at same time as OODS ? What about Network outages
KTL: yes or even before  - drop it if network is out. Have not code to go back to DAQ.

RHL: Not clear how much effort it is going via cloud, thought we were using OODS for prompt. 
KT: Cloud vs USD - zero for cloud .. we have to do it anyway USDF is a super set of tasks
For OODS - there is use for the delayed image butler, puts go there.
As to if this is commissioning cluster - no reason why not.  

RHL:looks different
KTL: its a different way to run pipe_task, could replace some of this with PanDA (discussion for later).  This is using workers essentially as their own pods.. some advantage to batch system but may not work in this context. 

EB: end July some time from AP people .. cloud only not USDF unless huge push. 
KTL :will support that. 
RHL: also Tiago to support from mountain. 

12:30Break

Moderator: Wil O'Mullane 

Notetaker: Kian-Tat Lim 

1:00 (DMLT only)

High-visibility roles for senior team members




1:30



14:00 Close

Day 2, Wednesday

Moderator: Ian Sullivan 

 Notetaker: Colin Slater 

9:00USDF status and Hybrid modelRichard Dubois Hybrid what is it ? USDF transition plan - brief presentation time for questions. RTN-021 and DMTN-189
  • Frossie: may still want to host a partial qserv on the cloud.
    • Fritz: would not recommend qserv below a certain size scale, if it's just object-lite then may be better off with a conventional DB
  • Frossie: would prefer butler client-server to have the server in the cloud, and it can do the backhaul to SLAC
  • KT: client-server butler is a preerequisite for anything. Have as much information as possible in the cloud, only the large data coming from SLAC. would enable better caching in the butler. User storage only at the cloud, not at slac, better for quota management etc. But then batch processing is complex, release data is separate from user data in that model.
    • Richard: HEP often has users submitting to "anywhere", but how the user data gets to that compute is tricky.
  • Gregory: can't assume that all user file workspace data is stored in the butler.
  • Ian: what limits are there on transfer e.g. of the coadds?
    • KT: could put throttles in the various VO services. But hasn't been done. DESC is clearly asking about direct transfers.
  • RHL: what is the definition of when we need to be ready? Are you thinking auxtel or only comcam?
    • August 15, all data has to go to SLAC. Would hope that the auxtel mechanism is the same as the comcam mechanism.
  • Jim: Is the plan for transferring NCSA Postgres just dump/load? Hasn't been any discussion yet.
  • CTS: what fraction of /repo/main is going to SLAC? All of it.
9:30Status of Campaign Management planning (pdf, pptx)Eric Charles plan and how it might be implemented.
  • Wil: interesting that you "check off" some of the datasets manually, seems labor-intensive.
    • In Fermi they mostly just look at warnings and follow up on those. Don't inspect every pipeline run.
  • (missed question from Robert)
  • Leanne: Gaia would send alerts.
    • Fermi page shows a bunch of results; could write something to look at that and send alerts.
    • KT: expect metrics to go to something like sasquatch, and then alerts go to OpsGenie.
  • Jim: distinguish between defining the inputs to a campaign (what Robert is talking about), vs. splitting the inputs to separate workflows (what Eric is talking about).
  • Jim: can't quite use just tagged collections for grouping, since they only allow you to group the inputs to processing. So we need a way to group the output dataIds, have planned for a while but not quite implemented.
  • Richard: do you expect campaign management to be "auto-filling" the queue with work? Yes.
  • KT: why port to the cloud? Not a big difference but some amount of work to make things portable.
  • Yusra: 100sq deg HSC run was delayed so that we can do it on USDF. Once these tools are ready, it would be good to try them out.
  • Follow up
10:00Antu - commissioningCristián Silva 
  • Cristián Silva Robert Lupton What do we need for commissioning and when to order it ..
    • what needs to be installed on a cluster at the base or summit
    • do we need processing a the base or should we rely on USDF ?
    • how does camera diagnostic cluster fit in ..

Outside DM there are people we should probably bring into this discussion (Tony, Patrick?)

  • Frossie: if it needs to be at the summit, better to just add the nodes to (an existing summit system?) rather than keeping it separate. If it's acceptable for it to be off-summit, then might as well be USDF rather than Base. There's a cost to having "yet another cluster," want to make sure it's worth it.
  • KT: separation of workloads is one of the reasons for separating them. And had originally hoped that the comm cluster wouldn't need to be DDS-enabled. Reasons to have it in Chile: try for lower latency, keep function if LHN goes down (but LHN may have better reliability than the base-summit link), sharing long term storage with Chilean DAC.
  • RHL:
    • Proposes to move commissioning cluster to the summit.
    • OSS is out of date on what is at the base.
    • Summit-base link has been unreliable. Microwave is coming, but not big enough for data (only control).
    • 95% of the time expect to be able to use the USDF. But need to figure out the minimum compute we need for when the networks are down. FAFF2 may work on that.
  • Cristián: When would you be using this? RHL: Ideally, Would like to be running realtime auxtel analysis in six weeks.
    • Not going to happen in six weeks. (understood).
  • Wil: what does the commissioning cluster run?
    • Something that looks at the last 24 hours and says is the data ok. Not the same as prompt processing.
    • If we're doing this, then it's more of a summit compute facility and not a commissioning thing, we should call it something different.
    • But also there's the diagnostic cluster? Merge this (Antu) with either diag cluster or Yagan (telescope cluster at the summit).
  • Frossie: role is muddied. From Frossie POV, there are two types of clusters, telescope clusters (DDS enabled. controlled) and science clusters (fast rollout). Don't mind having two clusters at the summit, but should decide if this is DDS vs science.
    • In early days, we said we could live without the network for ~days, but now we've adopted things like the science platform that we can't live without for that long. Need to do an inventory for what software are must-haves during a night without the network. If QA is necessary but we can live without it for ~3 nights a year, then USDF.
  • Cristián: all for having fewer clusters.
  • KT: inventory of: things that need to be at the summit because they rely on pixel data, or because they're needed for observing the next day. Second inventory: what things can survive being cutoff from the summit; hope that microwave means that nothing will be totally broken if the fiber is broken, but if both are broken then many things may be hosed.
  • Camera diagnostic cluster: runs camera visualization pan/zoom. Camera team will expect to run camera team tools on it, their particular database/display utils. Simplest thing is to keep it small.
    • Wil: agree, let's leave the diag cluster out of this.
  • let's write a document.
  • KT: still lots of questions about who owns the services on top of this cluster: butler, rsp, etc.
10:40Break

Moderator: Kian-Tat Lim 

Notetaker: Wil O'Mullane 

11:10Milestone paradeWil O'Mullane 
  1. Sparse milestones until Sept 2023 ..
  2. 1a done epics/milestone  - Leanne has a test plan  DM-32340 - Getting issue details... STATUS (not tagged nor linked to milestone in LDM-503). DM-17133 - Getting issue details... STATUS should we modify to say all1a verified on this ?
  3. https://docs.google.com/spreadsheets/d/1TUIUf84qHX5QfcCNWs27HGKlgHmCKcCs1IpBP5ODNDA/edit#gid=2061634320

32340 - add 503 milestone.  -

Still need the DAX milestones ..needs frossie and gregory (Provenance, VOSpace, ..)

Qserv milestone was added by Kevin.


11:39

Plans and Status I

DM Science

DAX

DRP

NCSA

SQuARE

>>> presentors = ['Leanne', 'Cristian', 'Frossie', 'KT', 'Fritz', 'Stephen', 'Ian', 'Yusra']

>>> random.shuffle(presentors)

['Leanne', 'Fritz', 'Yusra', 'Stephen', 'Frossie', 'KT', 'Cristian', 'Ian']


Leanne Calibpalooza august 25/26



Yusra did  say Gen3 once (just to note she did not).


KT is TM Gateway a new thing or one of the previous incarnations ?  Its write to sasquatch at USDF then say which topics need to be replicated back to summit (SQR-068) .. KT happy with this.

Demo of the cached runable parameterized  notebooks (times square)

12:30

Moderator: Wil O'Mullane 

Notetaker: 

1:00


Arch (.key)

IT-Devops

AP


WOM - MR mentioned UWS on TTS problem .
1:45Wrap upNext vF2F Oct 18,19(,20)

On cutting items at time (or a little over)

(IS)should keep buffer at end of session to allow run over

(FE)ok as long as we schedule follow up


Day 3, Thursday - Open for spill over topics

Time (Project)TopicCoordinatorPre-meeting notesRunning notes

Moderator:

Notetaker:

09:00













12:00Close




Originally Proposed Topics

ANalyse-DRP and Faro
I would like to understand the state of convergence here an dhow to get Faro in the main stream development.
Antu - comissioning30mn - 1 hr
  • Cristián Silva Robert Lupton What do we need for commissioning and when to order it ..
    • what needs to be installed on a cluster at the base or summit
    • do we need processing a the base or should we rely on USDF ?
    • how does camera diagnostic cluster fit in ..

Outside DM there are people we should probably bring into this discussion (Tony, Patrick?)

High-visibility roles for senior team members30 min open + 30 closed, ideallyWe have a number of people (at least in Science Pipelines) who have been on the team for a long time and contribute a lot, but don't regularly do the kind of work that leads to visibility outside DM or even outside their own DM team, and in some cases they are feeling a bit left behind.  I would like to discuss ideas for recognizing these team members and improving their external visibility, and especially get feedback on some ideas from those on the DMLT who have a lot more experience managing people than I do.  My main idea is to define some new roles with real responsibility over technical aspects of DM that are both of interest to the person taking the role and genuinely in need of more attention or coordination.  I would like to have both an open session to discuss this in the abstract and a closed session to talk about an initial batch concrete roles for specific people.
Metric persistence in the Butler30mins-1hr

Tests using DC 0.2 to test loading and querying thousands of metrics persisted as lsst.verify.Measurement objects ( DM-34556 - Use DC 0.2 to test loading large numbers of metrics In Progress show ~ 0.3 to 0.5 sec per  lsst.verify.Measurement  object. This is not tenable for science verification work. I understand there are plans to use JSON instead of YAML as teh storage format to address this? What are the concrete plans and timeline for implementation? 

Need to shift DMLT5 min

https://doodle.com/poll/cd37rsi7ghbnx2ce?utm_source=poll&utm_medium=link

LSST Europe is 24-28 October (currently our DMLT vF2F) : https://sites.google.com/inaf.it/lssteurope4

Suggest we shift to Oct 18

PPDB, APDB, User databases - status update30mins-1hrWhat is the status on choice of technology for the PPDB, APDB user databases?
Status of Campaign Management planning30-60 mins

Eric Charles  has been talking to the key stakeholders and can report on his plan and how it might be implemented.

User Batch30min

DMTN-223  has open issues/questions


Attached Documents


Action Item Summary

DescriptionDue dateAssigneeTask appears on
  • Frossie Economou Will recommend additional Level 3 milestones for implementation beyond just the DAX-9 Butler provenance milestone.   
15 Mar 2022Frossie EconomouDM Leadership Team Virtual Face-to-Face Meeting, 2022-02-15 to 17
  • Kian-Tat Lim Convene a meeting with Colin, Tim, Robert, Yusra to resolve graph generation with per-dataset quantities (likely based on Consolidated DB work).  
18 Mar 2022Kian-Tat LimDM Leadership Team Virtual Face-to-Face Meeting, 2022-02-15 to 17
  • Frossie Economou Write an initial draft in the Dev Guide for what "best effort" support means  
17 Nov 2023Frossie EconomouDM Leadership Team Virtual Face-to-Face Meeting - 2023-Oct-24
  • Convene a group to redo the T-12 month DRP diagram and define scope expectations Yusra AlSayyad 
30 Nov 2023Yusra AlSayyadDM Leadership Team Virtual Face-to-Face Meeting - 2023-Oct-24
11 Dec 2023Gregory Dubois-FelsmannDM Leadership Team Virtual Face-to-Face Meeting - 2023-Oct-24