Logistics

Date 

2021 October 19 and 20

Location

This meeting will be virtual;

Join Zoom Meeting

https://noirlab-edu.zoom.us/j/98768200109?pwd=cDVWTzJCVzFuaDRBU0tqb0hEeEFCdz09

Meeting ID: 987 6820 0109
Password: 161803




Day 1, Tuesday October 19 2021

Time (Project)TopicCoordinatorPre-meeting notesRunning notes

Moderator: Leanne Guy 

Notetaker: Simon Krughoff 

09:00Welcome



09:15Project news and updates
  • Frossie: (re: Chuck's effort to identify missing scope) we don't have spare people at the moment
  • Wil: we have several potential pools of in kind, pre-ops, etc.
  • Frossie: just trying to manage expectations because prioritizing one thing means de-prioritizing another
  • Robert: some of these things will preclude us going on sky, so these hard prioritizations need to be done
  • Wil: a sticky issues is that the project was scoped for nominal operations, but commissioning is not nominal operations, so scope was definitely missed in tooling in that area


  • Leanne: we have requirements on the base.  Does handing off the base mean we have to do a test campaign to verify? Requirements in DMSR and in the ICD documents. 
  • Wil: We have already accepted the base, but we should go through any requirements and verify we have met them


  • Tim: Have we got a plan to off board people from github etc?
  • Wil: Update of DM team on regular basis?
  • K-T: Working on a one time script, but that's not a plan.
  • Wil: We don't always want to remove people when they go off project.  Sometimes they are willing to keep contributing
  • Tim: We could do what Eric suggested and just do a yearly walk through
  • Frossie: Kick people off the org and convert them to outside collaborators if they are going to continue contributing.  This should be left up to the T-CAM
  • Wil: Do we want to capture this on the off boarding list as a check box?
  • Frossie: Sure
  • Wil O'Mullane ask for a new entry in the off boarding form to make sure people are removed or moved to external collaborator status in github when they leave  
09:30Low hanging fruit milestones and how can we claim them 

We have a bunch of milestones that can be closed with a bit of work that keeps getting put off or for lack of the formal testing being completed. Let's go through them and come up with a plan. 


Wil's google sheet for milestones

  • Robert: But what if these milestones don't actually reflect what work we actually need to do
  • Frossie: There may be a couple of those, but by and large that doesn't seem to be the case with the current set of milestones
  • Michelle: Can we try to organize the completion process a little?  E.g. put them all in a spreadsheet and identify next steps and responsible teams so that we can get a better handle on how to make progress
  • Wil: I have a sheet like that, though there are probably new ones.  I can resurrect that document
  • Gregory: Looking at overdue ones in my areas, they are all in active development.
  • Frossie: Those are tractable.  LDM-503-EFDc is a case where there is no one team who can retire it on their own.
  • Gregory: Exactly.  Maybe we should look for ones that haven't been started and that we may not even have architecture for


  • Fritz: My work is under represented by milestones relative to work left.
  • Frossie: I'm worried about user databases which does have a milestone and is a problem since it's one of these that takes many teams
  • Frossie: Maybe we should have each T-CAM classify their late milestones into: not relevant, mostly done, need to be moved to another location, etc.
  • Frossie Economou with T-CAMs will do some taxonomy to try to categorize late and soon to be due milestones  
  • Wil O'Mullane will update the spreadsheet to remove done ones and add new ones  
  • Yusra: Most milestones can be completed internally.  Exception is AP-15 and DRP-24.  We can't do anything more without precursor data
  • Fritz: We did a big walkthrough and filed LCRs and moved the needle.  Maybe we just need another one of those at the next F2F
  • Wil: We could do that
  • Gregory: Is there low hanging fruit?
  • Yusra: The ones in orange in Wil's spreadsheet are first guesses at the lowest hanging fruit
  • Frossie Economou will run a milestone "parade" for a time box starting 09:00 project on Thursday  

The output of the parade should be a fleshed out version of the spreadsheet.

10:05Data IDs


  • Frossie: There is a way to uniquely identify a piece of data.  We need a way to hand these things around through services and processing etc.
  • Tim: If it's just coming from a service, we can just pass around opaque UUIDs.  These UUIDs are specific to a butler.
  • Jim: If you are getting them from the database, then UUIDs are the thing you want.  If the user needs to generate them, UUIDs are not going to work
  • Tim: We can't put the UUID in the metadata at put time because we don't know it and not all data types have a sense of metadata
  • Gregory: Service should translate obscreatorid (our UUID) to datasettype, collection and dataId and then pass that on to the processing
  • Tim: We could write a task that takes UUIDs and then does the query to unpack them and continue on
  • Simon: If the service is responsible for unpacking the UUIDs, it will also have to know which pipeline task to execute for each dataset type.
  • Jim: We can do a lot with dimensions we have, but we could add new ones through a schema migration.  It's painful, but if necessary, we can do it.
  • Frossie: As SQuaRE can I just care about UUIDs?
  • Gregory: No.  Your service will need to know how to turn UUIDs into DataRefs
  • Tim: getDataset is what you want
  • Frossie Economou will schedule a focused follow on meeting to discuss data identifiers with the relevant parties: e.g. Simon, Tim, Gregory, Yusra, Jim  
10:30Break

Moderator: Wil O'Mullane 

Notetaker: Cristián Silva 

11:00Verification plan through end of construction

Picking up what I should have presented at an earlier F2F meeting (but got ill) 


slides

Leanne : Presentation

Acceptance Test. 
Robert: Auxtel data campaing to be includede en acceptance test.
Leanne : Agreed
KT: Who run acceptance test?, who organize them? are we using thinkgs already done, or is it new work?
Leanne: Organized by Jeff Carlin and Leanne, will need help of product owners. Should be able to execute unless product owner wants to do it.
Frossie: Retiring pre-comcam data should be fine. Retire L2 when level 3 tests are done also good. 
Fritz : For databases there are scale requirements related to datasets, which are not the same as in operations. 
Leanne: Run tests on datasets available, could stay in verification status until lsstcam data is available. 
Robert: On regards to KT question, Robert can get Sitcom scientists that could like to be involved in acceptance tests. 
Frossie: Can't wait for DR1, so performance requirements must done "at scale". We could use an artificial load. 
Fritz: Some other things may not appear until we got data production at scale. 
Frossie: Could fulfill level3s while level2 activities are ongoing. 
Wil : 1a, 1b are camera focus. But we do need to prove at scale. 

Ops Rehearsals
Frossie: Different ops. rehearsal than the past. Commisioning style, to find if how we do things is wrong
Robert : Similar to what we do
Frossie: People involved now are experienced w/Auxtel. Comcam is something new. 
Leanne: More about actors and interation than components. Should be DM ops. rehearsals ? or Rubins?
Wil: Would like to be Rubins. Next OR should be focus on commisioning.
Robert: Could be good to do a "real" OR with more/new people involved
KT: Auxtel is not using final components, so training to use this way could be a problem later.
Robert: Auxtel is useful and is good to discover what's missing
Wil : We shouldnt give things that are not ready. ie. API to Alysha
Leanne: Like the idea of Rubin's Ops Rehearsal 
Wil: We shouldnt push everything, there are some DM only activities. 

Network
KT: Can we do it now
Cristian: I rather wait to not do work twice. 
Leanne: We can wait. 
Cristian: if this is taking too much time, we can still do it. 
Robert: Does base facilities verification includes running pipelines in antu
Wil : Not in scope, but we can do it. 
KT : Base is about facilities not the services
Wil: Base facilities was handed over to operations Noirlab.

Middleware
Frossie: Worried Jim as middielware product owner. Perhaps could be too much load for Jim. 
Jim: Already doing some of this. 
Gregory: Backing up Jim. 
Wil: Need to update org chart ?
Leanne: Already started updating.
Wil: About org charts, product owner of LHN should be moved to Richard. 

RSP Acceptance Test
Gregory: Running test campaignsa, they are good cause we always find something. 

DM Science Validation
Wil: Verification on DM side, validation in conjunction with the rest. 

Sizing:
Robert: Sizing means memory, cpu, etc. 
Gregory: Release field is a non trivial problem/
Wil: Commisioning could be continous release process, and one final release for operations 
KT: Concerns about the buying hardware for USDF given the leadtimes and timeline. 
Richard: Haardware ordered, perhaps arrives in January
KT: Data release, if you need to patch still the same data release because replaces code. 
Wil: Not a problem for commissioning. 
Jim: During commissioning shouldn't be a problem.
Wil: Number of IDs in a patch, can be splitted for ID purposes? make smaller patches...
Wil: Science team can investigate about it. 
Wil: DMTN-135 has good information about hardware

12:10SizingKian-Tat Lim slides





12:30Break

Moderator: Wil O'Mullane 

Notetaker: Ian Sullivan 

1:00

Consolidated DB or equivalent

FE: A number of issues getting us stuck:

  • If someone asks for "all the observations taking by Auxtel last night" there is no way to get that to them
  • We have ~4 different of ways of representing this data
  • We could take advantage of the butler registry, which has all of the information. Would have to expose it from the butler as a way to get to the tables and views that we need to build services that query the DB.
  • Can we expose the butler registry in this way?
  • Who is leading this effort? There is no general view into the metadata
  • JB: I don't actually hate this, the big thing that has changed is the Butler registry schema
  • RHL: I have exactly the opposite reaction. We have an enormous amount of things that must be captured here. Mixing that up with an operational database that has to run the system seems like a mistake
    • KTL: This has to be an operational database. We need something that includes both metadata that we know at the time of observation, as well as information that is calculated later.
  • FM: Frossie, how much of your concern is addressed by metadata tables that are already in the DPDD in combination with reformatted EFD?
    • KTL: The DPDD does not have any observational metadata (the tables you are thinking of are in the baseline schema)
  • WOM: We had a long data
  • GPDF: “Consolidated Database” was also in contraposition to Qserv
  • JB: I don't want to put much more into Butler code. It should either be completely separate, or written as an extension
    • FE: What I want to do is seed the views that I want. Can we use the butler registry for that?
  • TJ: I am supposed to write a technote on observation annotations, i.e. to let an observer flag the nights observations as bad
  • GPDF: I imagined that we would take advantage of the Butler's existing tables, but not mess with the butler itself. It could be read-only, and we would do views instead of joins.
  • GPDF: What we're trying to do is allow this be done in the live butler, not in the replica after a delay. This would allow an observer to enter annotations immediately instead of first having to wait for the Butler to update
  • RHL: I worry that we are focusing on the technical implementation, rather than defining what we need first
  • FE: We all agree that we need an exposure log. Right now there is neither a technical path nor a management path unless we do something here
  • FE: It is a DM requirement to deliver an exposure table. Who is doing this?
  • KTL: That's the big problem. We know the data is coming, we just need a place to hold it.
  • WOM: We need to be careful to say that we would be making use of the butler, not that it would be a part of the butler.
  • JB: There are different levels of interface. If we want these things to be joined against butler tables, we need to be careful. It is not easy to do that, but it can be done.
  • TJ: I am only writing up the specific case of how you write up annotations, not how you deal with tracking
  • RHL: I think it is SITCOM's responsibility to lead this, and DM to build the backend according to their directions
  • KT: SITCOM can be product owners and work on front ends, but we need at least a prototype backend to work with them on. If no one wants to take this on, I can do it.
  • GPDF: The image metadata table issue is something we discussed as part of the image services, and is something I am already working on. I feel a lot of responsibility for the backend, though I can't do the front end.
  • RHL: I can provide a prototype of a backend from an obs log from HSC. I am happy to work with Architecture 
  • JB: I'd like to weigh in on how this would interact with the Butler. Is there more than what is in the baseline schema?
    • WOM: No, that's the problem, we need the product owners to tell us what is missing
  • TJ: It is trivial to query the butler for all observations that were taken last night, but impossible to tell what the last observation was since there is no ordering
  • FE: Where does this live?
    • TJ: OODS
  • WOM: Arch will take care of this for construction, either K-T or Tim
  • FE: When does the data show up?
    • KT: during AP, so within 60s
    • FE: That is acceptable
  • GPDF: Which TCAM am I working with, and who is actually building something?
  • FE: Propose that we have the basic architecture for this presented at the next DMLT F2F.
    • KT: I think a prototype should be done by then, and I can make it. You will all hate it, but it will allow us to have a discussion
  • Kian-Tat Lim Write up a prototype consolidated database  
  • KSK: where will the proof of concept be hosted?
    • KT: the IDF on Kubernetes, unless the USDF is up in time





14:00 Close

Day 2 October 20 2021

Moderator: Fritz Mueller 

 Notetaker: Kian-Tat Lim 
9:00Minor admin

Star time Tomorrow

Order if Status later


9:00

Requirements on user batch

DM Science will present requirements on user batch. dmtn-202.lsst.io
GPDF produced a Confluence page with requirements relevant to User-Generated Data Products and computing available to science users to produce them, but it only addressed high-level capabilities.
User Batch has to address products derived from both catalogs and images (not all images: selected subset based on catalogs).
10% of capacity required for survey will be provided to users (as $, not necessarily same mix as production); much smaller than e.g. DESC needs.
  • Frossie: Does 10% include nublado? If so, then may not be much leftover for batch; if not, then nublado is uncosted increase
  • Batch will still be allocated (and thus maybe can be harder to use); can take into account whether users will make results public
  • Nublado has somewhat absorbed original birthright concept
Need to provide a processing framework for systematic runs over appropriate data.
Quotas need to be able to go to groups as well as users.

Catalog use cases include training classifiers running on sharded Parquet.
Next-to-data is therefore not really DAX/Qserv-related anymore.
Dask in nublado works well; others like UW/LINCC are using Spark.
Data Science community constantly producing new tools.
(Could perhaps leverage LINCC work, but they may not be scalable enough for us.)

Frossie: need to answer these:
  •  What resources are devoted to nublado vs. batch (and maybe vs. Dask)?
    •  Hard to imagine fulfilling computing reserve with nublado; can't devote entire 10% to it
  •  Specific asks from specific teams to build this system?

Tossing a SLURM queue at users does not meet requirements, but can be part of the solution.
BPS + batch queues is designed around single-tenancy; needs work for multi-user.
But may be generic enough for user processing.

Richard: 10% is ~500 cores, not much; users need unstructured compute.

Frossie: Project wants to reach out to a wide variety of users; can't have both birthright and batch.  The difference is that batch utilization is by policy, nublado utilization is by user. 

Tim: chasm between PipelineTask with Butler and BPS vs. running arbitrary jobs.  Isn’t this why users on Google is easier because they *can* pay with research grant money, so we are saying that arbitrary compute is not what we are doing and user batch is only supported via bps/butler.
Wil: need to sit down with Richard and figure out whether 10% needs to increase and how to divide it; his assumption has always been that it is PipelineTask and not general; also need to add priorities and timelines for when users get what.  Need to provide lots of alternatives to users.
KTL:

  • Do we need to divide up front?
    • No, can be elastic.
  • Can VO services provide interface for arbitrary jobs?
    • Not likely to be sufficiently scalable; don't want 100K jobs hitting VO services
    • Although not clear if users will have 100K jobs...
  • Don't yet know if we can hook up Google properly for bring-your-own — need a demonstration.

GPDF: don't need a steering framework for "freeform compute" but do need data access, and VO is not specified for this load; cannot force everyone into PipelineTask framework.
Frossie: we have to force everyone into PipelineTask framework.
Colin: what pieces of software will people use and how will they run them? Detailing use cases will help.
GPDF: We could provide a command-line tool that a) extracts a file, based on a (collection, dataid, type) triplet, to whatever scratch space is available to batch jobs, and/or b) extract a signed object-store URL that they can use in whatever code they have.
Tim: This is butler retrieve-artifacts 

Wil actions:

  • Tidy up description of percentages in DMTN-135 to make clear that 10% includes all user computing Wil O'Mullane
     
  • Address increases in compute allocation to users if needed Wil O'Mullane
     
  • Respond to Gregory's DMTN-202 Wil O'Mullane 
9:30RSP policy issuesUser data guarantees, backwards compatibility promises, hybrid model impacts etc

User data guarantees

User data locations:

  • POSIX (home/project) filesystem
    • Will have backups, but no restrictions up to quota
  • Butler collection
    • How does user data move from one DR to another
      • Users should have to move themselves
    • Can break Butler compatibility between DRs if necessary
  • User databases
    • Similar to Butler for DRs

Access to data is better-preserved with schema migration and backward-compatible new software.
Reproducing results is better-preserved with old Registry and software.

10:00Terminology What else should we change while changing the default branch?

"Master calibration" is a standard well-understood term.

Try to find other observatories to choose a new term.

  • Reach out to Dara Norman about a replacement term for "master calibration" Frossie Economou  

Some "blacklist/whitelist" usage fixed in Qserv docs/comments.

Could have a hackathon for this?  Consider for next PCW.

10:15Focus FridayPoll

Requests for "Focus Friday" exceptions by developers in AP have generally been addressed by "save for later" or other tools.
Should consider using scheduled messages every day to deal with timezones, not just Fridays.
But some people can/will read and respond even outside normal work hours.
Need to better understand culture issues in general; perhaps do a similar survey on a different issue for each meeting?

RHL: What will we do about the lack of documentation?
Ian: Scheduling work for people to add documentation.
Writing answers provided on Fridays directly into documentation helps.

Wil: everyone likes not having meetings; some relaxation of Slack rules might be considered; not an overwhelming push to change things

RHL: Future surveys should get better coverage from non-DM people.

10:30   Break

Moderator: Wil O'Mullane 

11:00


Premortem


Entries here 


11:30Status part I
  • Leanne - Science
  • Crstian - IT/LHN
  • Frossie - SQuaRE/Sci Plat
  • KT - Arch
  • Fritz - DAX
  • Michelle - NCSA

La Serena -29.91, -71.24

Tucson 32.20, -110.96

Gibraltar/Malaga 34.14 , -5.35

SF  37.76, -122.43

Oakland 37.80, -122.22

Urbana 40.11,-88.199

Princeton 40.34, -74.68

Seattle 37.65, -122.30

12:30

Break

Moderator: Robert Lupton Notetaker:

1:00


Status part II


1:45Wrap up 

DMLT: T/W/Þ  2022-02-15 in Tucson; virtual  2022-06-14; 2022-10-18

Day 3, Thursday October 21 2021

Time (Project)TopicCoordinatorPre-meeting notesRunning notes

Moderator:

Notetaker:

Wil O'Mullane 
09:00Milestone Parade

https://docs.google.com/spreadsheets/d/1TUIUf84qHX5QfcCNWs27HGKlgHmCKcCs1IpBP5ODNDA/edit#gid=0


10:30Close




Proposed Topics

TopicRequested byTime required (estimate)Notes
Requirements on user batch30DM Science will present requirements on user batch. dmtn-202.lsst.io
How can we finalise Data IDs?Frossie Economou 30One of the biggest obstacles we have to putting VO services into (Rubin data) production is the fact that we do not have an agreed format for Data IDs uniquely identifying our data products. This information exists in a dict, but we don't have a scheme for converting it to a string. Let's discuss the complications and come up with a plan. 
Low hanging fruit milestones and how can we claim them Frossie Economou 60We have a bunch of milestones that can be closed with a bit of work that keeps getting put off or for lack of the formal testing being completed. Let's go through them and come up with a plan. 
Consolidated DB or equivalentFrossie Economou 45

We need a plan on how to get observation metadata somewhere where VO services can get them - this means probably the Consolidated DB (though, crazy idea, Butler registry seems to know most of this stuff?) Right now nobody seems to own this and be working it so we have to come up with an actionable plan. 

KTL: Isn't this DM-30853?

RSP policy issuesFrossie Economou 30User data guarantees, backwards compatibility promises, hybrid model impacts etc. 
Verification plan through oend of construction60Picking up what I should have presented at an earlier F2F meeting (but got ill) 
Focus FridayIan Sullivan 15

I am getting more feedback from developers getting frustrated by some aspects of Focus Friday. Note that these concerns are mostly addressed by the open support channels and by instructing them on how to use Slack's "Schedule for later" feature.

Sizing15What changes in data products or pipeline step complexity or memory usage are known or anticipated?  What is the plan for DM-22082?
Terminology15What else should we change while changing the default branch?


Attached Documents


Action Item Summary