Data Management > DM Leadership Team Virtual Face-to-Face Meeting, 2021-10-19 to 20 > Screenshot 2021-10-20 at 20.37.40.png

Logistics

Date

2021 October 19 and 20

Location

This meeting will be virtual;

Join Zoom Meeting

https://noirlab-edu.zoom.us/j/98768200109?pwd=cDVWTzJCVzFuaDRBU0tqb0hEeEFCdz09

Meeting ID: 987 6820 0109
Password: 161803

Day 1, Tuesday October 19 2021
Time (Project)	Topic	Coordinator	Pre-meeting notes	Running notes
Moderator: Leanne Guy	Notetaker: Simon Krughoff
09:00	Welcome	Wil O'Mullane
09:15	Project news and updates	Wil O'Mullane		Frossie: (re: Chuck's effort to identify missing scope) we don't have spare people at the moment Wil: we have several potential pools of in kind, pre-ops, etc. Frossie: just trying to manage expectations because prioritizing one thing means de-prioritizing another Robert: some of these things will preclude us going on sky, so these hard prioritizations need to be done Wil: a sticky issues is that the project was scoped for nominal operations, but commissioning is not nominal operations, so scope was definitely missed in tooling in that area Leanne: we have requirements on the base. Does handing off the base mean we have to do a test campaign to verify? Requirements in DMSR and in the ICD documents. Wil: We have already accepted the base, but we should go through any requirements and verify we have met them Tim: Have we got a plan to off board people from github etc? Wil: Update of DM team on regular basis? K-T: Working on a one time script, but that's not a plan. Wil: We don't always want to remove people when they go off project. Sometimes they are willing to keep contributing Tim: We could do what Eric suggested and just do a yearly walk through Frossie: Kick people off the org and convert them to outside collaborators if they are going to continue contributing. This should be left up to the T-CAM Wil: Do we want to capture this on the off boarding list as a check box? Frossie: Sure Wil O'Mullane ask for a new entry in the off boarding form to make sure people are removed or moved to external collaborator status in github when they leave 01 Nov 2021
09:30	Low hanging fruit milestones and how can we claim them	Frossie Economou	We have a bunch of milestones that can be closed with a bit of work that keeps getting put off or for lack of the formal testing being completed. Let's go through them and come up with a plan. Wil's google sheet for milestones	Robert: But what if these milestones don't actually reflect what work we actually need to do Frossie: There may be a couple of those, but by and large that doesn't seem to be the case with the current set of milestones Michelle: Can we try to organize the completion process a little? E.g. put them all in a spreadsheet and identify next steps and responsible teams so that we can get a better handle on how to make progress Wil: I have a sheet like that, though there are probably new ones. I can resurrect that document Gregory: Looking at overdue ones in my areas, they are all in active development. Frossie: Those are tractable. LDM-503-EFDc is a case where there is no one team who can retire it on their own. Gregory: Exactly. Maybe we should look for ones that haven't been started and that we may not even have architecture for Fritz: My work is under represented by milestones relative to work left. Frossie: I'm worried about user databases which does have a milestone and is a problem since it's one of these that takes many teams Frossie: Maybe we should have each T-CAM classify their late milestones into: not relevant, mostly done, need to be moved to another location, etc. Frossie Economou with T-CAMs will do some taxonomy to try to categorize late and soon to be due milestones 01 Nov 2021 Wil O'Mullane will update the spreadsheet to remove done ones and add new ones 01 Nov 2021 Yusra: Most milestones can be completed internally. Exception is AP-15 and DRP-24. We can't do anything more without precursor data Fritz: We did a big walkthrough and filed LCRs and moved the needle. Maybe we just need another one of those at the next F2F Wil: We could do that Gregory: Is there low hanging fruit? Yusra: The ones in orange in Wil's spreadsheet are first guesses at the lowest hanging fruit Frossie Economou will run a milestone "parade" for a time box starting 09:00 project on Thursday 21 Oct 2021 The output of the parade should be a fleshed out version of the spreadsheet.
10:05	Data IDs	Frossie Economou		Frossie: There is a way to uniquely identify a piece of data. We need a way to hand these things around through services and processing etc. Tim: If it's just coming from a service, we can just pass around opaque UUIDs. These UUIDs are specific to a butler. Jim: If you are getting them from the database, then UUIDs are the thing you want. If the user needs to generate them, UUIDs are not going to work Tim: We can't put the UUID in the metadata at put time because we don't know it and not all data types have a sense of metadata Gregory: Service should translate obscreatorid (our UUID) to datasettype, collection and dataId and then pass that on to the processing Tim: We could write a task that takes UUIDs and then does the query to unpack them and continue on Simon: If the service is responsible for unpacking the UUIDs, it will also have to know which pipeline task to execute for each dataset type. Jim: We can do a lot with dimensions we have, but we could add new ones through a schema migration. It's painful, but if necessary, we can do it. Frossie: As SQuaRE can I just care about UUIDs? Gregory: No. Your service will need to know how to turn UUIDs into DataRefs Tim: getDataset is what you want Frossie Economou will schedule a focused follow on meeting to discuss data identifiers with the relevant parties: e.g. Simon, Tim, Gregory, Yusra, Jim 01 Nov 2021
10:30	Break
Moderator: Wil O'Mullane	Notetaker: Cristián Silva
11:00	Verification plan through end of construction	Leanne Guy	Picking up what I should have presented at an earlier F2F meeting (but got ill) slides	Leanne : Presentation Acceptance Test. Robert: Auxtel data campaing to be includede en acceptance test. Leanne : Agreed KT: Who run acceptance test?, who organize them? are we using thinkgs already done, or is it new work? Leanne: Organized by Jeff Carlin and Leanne, will need help of product owners. Should be able to execute unless product owner wants to do it. Frossie: Retiring pre-comcam data should be fine. Retire L2 when level 3 tests are done also good. Fritz : For databases there are scale requirements related to datasets, which are not the same as in operations. Leanne: Run tests on datasets available, could stay in verification status until lsstcam data is available. Robert: On regards to KT question, Robert can get Sitcom scientists that could like to be involved in acceptance tests. Frossie: Can't wait for DR1, so performance requirements must done "at scale". We could use an artificial load. Fritz: Some other things may not appear until we got data production at scale. Frossie: Could fulfill level3s while level2 activities are ongoing. Wil : 1a, 1b are camera focus. But we do need to prove at scale. Ops Rehearsals Frossie: Different ops. rehearsal than the past. Commisioning style, to find if how we do things is wrong Robert : Similar to what we do Frossie: People involved now are experienced w/Auxtel. Comcam is something new. Leanne: More about actors and interation than components. Should be DM ops. rehearsals ? or Rubins? Wil: Would like to be Rubins. Next OR should be focus on commisioning. Robert: Could be good to do a "real" OR with more/new people involved KT: Auxtel is not using final components, so training to use this way could be a problem later. Robert: Auxtel is useful and is good to discover what's missing Wil : We shouldnt give things that are not ready. ie. API to Alysha Leanne: Like the idea of Rubin's Ops Rehearsal Wil: We shouldnt push everything, there are some DM only activities. Network KT: Can we do it now Cristian: I rather wait to not do work twice. Leanne: We can wait. Cristian: if this is taking too much time, we can still do it. Robert: Does base facilities verification includes running pipelines in antu Wil : Not in scope, but we can do it. KT : Base is about facilities not the services Wil: Base facilities was handed over to operations Noirlab. Middleware Frossie: Worried Jim as middielware product owner. Perhaps could be too much load for Jim. Jim: Already doing some of this. Gregory: Backing up Jim. Wil: Need to update org chart ? Leanne: Already started updating. Wil: About org charts, product owner of LHN should be moved to Richard. RSP Acceptance Test Gregory: Running test campaignsa, they are good cause we always find something. DM Science Validation Wil: Verification on DM side, validation in conjunction with the rest. Sizing: Robert: Sizing means memory, cpu, etc. Gregory: Release field is a non trivial problem/ Wil: Commisioning could be continous release process, and one final release for operations KT: Concerns about the buying hardware for USDF given the leadtimes and timeline. Richard: Haardware ordered, perhaps arrives in January KT: Data release, if you need to patch still the same data release because replaces code. Wil: Not a problem for commissioning. Jim: During commissioning shouldn't be a problem. Wil: Number of IDs in a patch, can be splitted for ID purposes? make smaller patches... Wil: Science team can investigate about it. Wil: DMTN-135 has good information about hardware
12:10	Sizing	Kian-Tat Lim	slides

12:30	Break
Moderator: Wil O'Mullane	Notetaker: Ian Sullivan
1:00	Consolidated DB or equivalent	Frossie Economou		FE: A number of issues getting us stuck: If someone asks for "all the observations taking by Auxtel last night" there is no way to get that to them We have ~4 different of ways of representing this data We could take advantage of the butler registry, which has all of the information. Would have to expose it from the butler as a way to get to the tables and views that we need to build services that query the DB. Can we expose the butler registry in this way? Who is leading this effort? There is no general view into the metadata JB: I don't actually hate this, the big thing that has changed is the Butler registry schema RHL: I have exactly the opposite reaction. We have an enormous amount of things that must be captured here. Mixing that up with an operational database that has to run the system seems like a mistake KTL: This has to be an operational database. We need something that includes both metadata that we know at the time of observation, as well as information that is calculated later. FM: Frossie, how much of your concern is addressed by metadata tables that are already in the DPDD in combination with reformatted EFD? KTL: The DPDD does not have any observational metadata (the tables you are thinking of are in the baseline schema) WOM: We had a long data GPDF: “Consolidated Database” was also in contraposition to Qserv JB: I don't want to put much more into Butler code. It should either be completely separate, or written as an extension FE: What I want to do is seed the views that I want. Can we use the butler registry for that? TJ: I am supposed to write a technote on observation annotations, i.e. to let an observer flag the nights observations as bad GPDF: I imagined that we would take advantage of the Butler's existing tables, but not mess with the butler itself. It could be read-only, and we would do views instead of joins. GPDF: What we're trying to do is allow this be done in the live butler, not in the replica after a delay. This would allow an observer to enter annotations immediately instead of first having to wait for the Butler to update RHL: I worry that we are focusing on the technical implementation, rather than defining what we need first FE: We all agree that we need an exposure log. Right now there is neither a technical path nor a management path unless we do something here FE: It is a DM requirement to deliver an exposure table. Who is doing this? KTL: That's the big problem. We know the data is coming, we just need a place to hold it. WOM: We need to be careful to say that we would be making use of the butler, not that it would be a part of the butler. JB: There are different levels of interface. If we want these things to be joined against butler tables, we need to be careful. It is not easy to do that, but it can be done. TJ: I am only writing up the specific case of how you write up annotations, not how you deal with tracking RHL: I think it is SITCOM's responsibility to lead this, and DM to build the backend according to their directions KT: SITCOM can be product owners and work on front ends, but we need at least a prototype backend to work with them on. If no one wants to take this on, I can do it. GPDF: The image metadata table issue is something we discussed as part of the image services, and is something I am already working on. I feel a lot of responsibility for the backend, though I can't do the front end. RHL: I can provide a prototype of a backend from an obs log from HSC. I am happy to work with Architecture JB: I'd like to weigh in on how this would interact with the Butler. Is there more than what is in the baseline schema? WOM: No, that's the problem, we need the product owners to tell us what is missing TJ: It is trivial to query the butler for all observations that were taken last night, but impossible to tell what the last observation was since there is no ordering FE: Where does this live? TJ: OODS WOM: Arch will take care of this for construction, either K-T or Tim FE: When does the data show up? KT: during AP, so within 60s FE: That is acceptable GPDF: Which TCAM am I working with, and who is actually building something? FE: Propose that we have the basic architecture for this presented at the next DMLT F2F. KT: I think a prototype should be done by then, and I can make it. You will all hate it, but it will allow us to have a discussion Kian-Tat Lim Write up a prototype consolidated database 15 Feb 2022 KSK: where will the proof of concept be hosted? KT: the IDF on Kubernetes, unless the USDF is up in time

14:00	Close
Day 2 October 20 2021
Moderator: Fritz Mueller	Notetaker: Kian-Tat Lim
9:00	Minor admin	Wil O'Mullane	Star time Tomorrow Order if Status later
9:00	Requirements on user batch	Leanne Guy Gregory Dubois-Felsmann	DM Science will present requirements on user batch. dmtn-202.lsst.io	GPDF produced a Confluence page with requirements relevant to User-Generated Data Products and computing available to science users to produce them, but it only addressed high-level capabilities. User Batch has to address products derived from both catalogs and images (not all images: selected subset based on catalogs). 10% of capacity required for survey will be provided to users (as $, not necessarily same mix as production); much smaller than e.g. DESC needs. Frossie: Does 10% include nublado? If so, then may not be much leftover for batch; if not, then nublado is uncosted increase Batch will still be allocated (and thus maybe can be harder to use); can take into account whether users will make results public Nublado has somewhat absorbed original birthright concept Need to provide a processing framework for systematic runs over appropriate data. Quotas need to be able to go to groups as well as users. Catalog use cases include training classifiers running on sharded Parquet. Next-to-data is therefore not really DAX/Qserv-related anymore. Dask in nublado works well; others like UW/LINCC are using Spark. Data Science community constantly producing new tools. (Could perhaps leverage LINCC work, but they may not be scalable enough for us.) Frossie: need to answer these: What resources are devoted to nublado vs. batch (and maybe vs. Dask)? Hard to imagine fulfilling computing reserve with nublado; can't devote entire 10% to it Specific asks from specific teams to build this system? Tossing a SLURM queue at users does not meet requirements, but can be part of the solution. BPS + batch queues is designed around single-tenancy; needs work for multi-user. But may be generic enough for user processing. Richard: 10% is ~500 cores, not much; users need unstructured compute. Frossie: Project wants to reach out to a wide variety of users; can't have both birthright and batch. The difference is that batch utilization is by policy, nublado utilization is by user. Tim: chasm between PipelineTask with Butler and BPS vs. running arbitrary jobs. Isn’t this why users on Google is easier because they can pay with research grant money, so we are saying that arbitrary compute is not what we are doing and user batch is only supported via bps/butler. Wil: need to sit down with Richard and figure out whether 10% needs to increase and how to divide it; his assumption has always been that it is PipelineTask and not general; also need to add priorities and timelines for when users get what. Need to provide lots of alternatives to users. KTL: Do we need to divide up front? No, can be elastic. Can VO services provide interface for arbitrary jobs? Not likely to be sufficiently scalable; don't want 100K jobs hitting VO services Although not clear if users will have 100K jobs... Don't yet know if we can hook up Google properly for bring-your-own — need a demonstration. GPDF: don't need a steering framework for "freeform compute" but do need data access, and VO is not specified for this load; cannot force everyone into PipelineTask framework. Frossie: we have to force everyone into PipelineTask framework. Colin: what pieces of software will people use and how will they run them? Detailing use cases will help. GPDF: We could provide a command-line tool that a) extracts a file, based on a (collection, dataid, type) triplet, to whatever scratch space is available to batch jobs, and/or b) extract a signed object-store URL that they can use in whatever code they have. Tim: This is `butler retrieve-artifacts` Wil actions: Tidy up description of percentages in DMTN-135 to make clear that 10% includes all user computing Wil O'Mullane 30 Nov 2021 Address increases in compute allocation to users if needed Wil O'Mullane 30 Nov 2021 Respond to Gregory's DMTN-202 Wil O'Mullane31 Jan 2022
9:30	RSP policy issues	Frossie Economou	User data guarantees, backwards compatibility promises, hybrid model impacts etc	User data guarantees User data locations: POSIX (home/project) filesystem Will have backups, but no restrictions up to quota Butler collection How does user data move from one DR to another Users should have to move themselves Can break Butler compatibility between DRs if necessary User databases Similar to Butler for DRs Access to data is better-preserved with schema migration and backward-compatible new software. Reproducing results is better-preserved with old Registry and software.
10:00	Terminology	Kian-Tat Lim	What else should we change while changing the default branch?	"Master calibration" is a standard well-understood term. Try to find other observatories to choose a new term. Reach out to Dara Norman about a replacement term for "master calibration" Frossie Economou 30 Nov 2021 Some "blacklist/whitelist" usage fixed in Qserv docs/comments. Could have a hackathon for this? Consider for next PCW.
10:15	Focus Friday	Ian Sullivan	Poll	Requests for "Focus Friday" exceptions by developers in AP have generally been addressed by "save for later" or other tools. Should consider using scheduled messages every day to deal with timezones, not just Fridays. But some people can/will read and respond even outside normal work hours. Need to better understand culture issues in general; perhaps do a similar survey on a different issue for each meeting? RHL: What will we do about the lack of documentation? Ian: Scheduling work for people to add documentation. Writing answers provided on Fridays directly into documentation helps. Wil: everyone likes not having meetings; some relaxation of Slack rules might be considered; not an overwhelming push to change things RHL: Future surveys should get better coverage from non-DM people.
10:30	Break
Moderator: Wil O'Mullane	Notetaker: Gregory Dubois-Felsmann
11:00	Premortem	Frossie Economou Yusra AlSayyad	Entries here
11:30	Status part I		Leanne - Science Crstian - IT/LHN Frossie - SQuaRE/Sci Plat KT - Arch Fritz - DAX Michelle - NCSA	La Serena -29.91, -71.24 Tucson 32.20, -110.96 Gibraltar/Malaga 34.14 , -5.35 SF 37.76, -122.43 Oakland 37.80, -122.22 Urbana 40.11,-88.199 Princeton 40.34, -74.68 Seattle 37.65, -122.30
12:30	Break
Moderator: Robert Lupton	Notetaker:
1:00	Status part II		Yusra - DRP Ian - AP (slides)
1:45	Wrap up			DMLT: T/W/Þ 2022-02-15 in Tucson; virtual 2022-06-14; 2022-10-18

Day 3, Thursday October 21 2021
Time (Project)	Topic	Coordinator	Pre-meeting notes	Running notes
Moderator:	Notetaker: Wil O'Mullane
09:00	Milestone Parade	Frossie Economou	https://docs.google.com/spreadsheets/d/1TUIUf84qHX5QfcCNWs27HGKlgHmCKcCs1IpBP5ODNDA/edit#gid=0
10:30	Close

Proposed Topics

Topic	Requested by	Time required (estimate)	Notes
Requirements on user batch	Leanne Guy Gregory Dubois-Felsmann	30	DM Science will present requirements on user batch. dmtn-202.lsst.io
How can we finalise Data IDs?	Frossie Economou	30	One of the biggest obstacles we have to putting VO services into (Rubin data) production is the fact that we do not have an agreed format for Data IDs uniquely identifying our data products. This information exists in a dict, but we don't have a scheme for converting it to a string. Let's discuss the complications and come up with a plan.
Low hanging fruit milestones and how can we claim them	Frossie Economou	60	We have a bunch of milestones that can be closed with a bit of work that keeps getting put off or for lack of the formal testing being completed. Let's go through them and come up with a plan.
Consolidated DB or equivalent	Frossie Economou	45	We need a plan on how to get observation metadata somewhere where VO services can get them - this means probably the Consolidated DB (though, crazy idea, Butler registry seems to know most of this stuff?) Right now nobody seems to own this and be working it so we have to come up with an actionable plan. KTL: Isn't this DM-30853?
RSP policy issues	Frossie Economou	30	User data guarantees, backwards compatibility promises, hybrid model impacts etc.
Verification plan through oend of construction	Leanne Guy	60	Picking up what I should have presented at an earlier F2F meeting (but got ill)
Focus Friday	Ian Sullivan	15	I am getting more feedback from developers getting frustrated by some aspects of Focus Friday. Note that these concerns are mostly addressed by the open support channels and by instructing them on how to use Slack's "Schedule for later" feature.
Sizing	Kian-Tat Lim	15	What changes in data products or pipeline step complexity or memory usage are known or anticipated? What is the plan for DM-22082?
Terminology	Kian-Tat Lim	15	What else should we change while changing the default branch?

Logistics

Date

Day 1, Tuesday October 19 2021

Day 2 October 20 2021

Day 3, Thursday October 21 2021

Proposed Topics

Attached Documents

Action Item Summary