Logistics

Date

  – 

Location

This meeting will be held on Zoom:

For the meeting passcode, see #dm-camelot on Slack.

Participants

Agenda


Day 1: 2020-09-15

Time (Project)TopicCoordinatorPre-meeting notesRunning notes

Moderator: John Swinbank

09:00WelcomeWil O'Mullane
  • Introductory remarks
  • Review agenda and code of conduct
  • DMTN-153 reading for DMTN comments.
  • Robert Gruendl reports that OCPS development is ongoing.
    • Robert G. will report to Tim, K-T and Robert L. on the next set of development epics.
    • A design document will be forthcoming as part of the development process.
  • Frossie Economou  will follow-up on DM-15198/DMTN-139 with Gregory Dubois-Felsmann.
09:15Project news and updates
  • It's been three weeks since our last DMLT call; what's been happening?
09:40

John Swinbank transition

  • Review plans for John Swinbank leaving the country / the project over the next few weeks.
  • At (home) desk in Seattle through next week (until 2020-09-25); expect to continue regular work.
  • Flying to NL 2020-09-30. At this point, Yusra AlSayyad (overall manager, focus on DRP) and Ian Sullivan (deputy, focus on AP) assume full management responsibilities for the Science Pipelines team.
  • In quarantine in NL through 2020-10-14; will continue tidying up loose ends & writing documentation, and will be available for questions, meetings, etc, on demand.
  • 2020-10-15 onwards: stepping back from regular work on Rubin. Will continue to be available by mail (and Slack, probably) for questions, discussions, etc, on request.
09:45Community support
  • Melissa Graham  has proposed a model for technical support during construction, which spawned some discussion on RFC-703.
  • She has subsequently been developing these ideas into DMTN-155.
  • Do we have DMLT sign-off on these ideas? In particular, do they provide an adequate level of support to the community, without placing an excessive burden on the construction team?
  • How will these plans be communicated to the wider community?
  • The scope of this document is “science-level” questions, not technical support at the level of lost passwords etc.
    • The IT helpdesk will be provided by NOIRLab, SLAC, NCSA, etc.
  • (Leanne summarizes the DMTN; not transcribed here.)
  • Community engagement in operations is described in RTN-006.
  • How do we determine whether a given issue is an IT problem or a scientific issue?
    • Often, it will be “obvious”.
    • Where necessary, CET will provide triage through the Community Forum.
  • Request for more boilerplate: when a question has already been answered in Slack, but we need to take it to Community.
    • In general, people shouldn't answer on Slack, so ideally this doesn't happen.
    • But when it does, we should aim to copy & paste.
  • There is concern that running support through Community will increase the load on the DM team relative to Slack.
    • We expect it to continue to evolve based on experience from construction and DP0 (1, 2).
    • And the SST, CET, etc are very much open to feedback and suggestions.
  • This is not just an effort to prepare for ops, but also an effort to relieve growing pressure on the construction team.
  • There are requests for the CET to take a larger role sooner, but this is limited by its scope and funding source, which are both tied to operations.
  • We do have to ensure we acknowledge especially external contributors, and not simply try to redirect them to Community.
  • It is important that community support be tied to user-facing documentation.
    • The SST/CET should provide feedback to the development teams about where documentation needs to be improved.
  • Further discussion should be redirected to #dm-sst on Slack.
  • We expect the CET to play a coordination role in documentation, but will require inputs from the Pipelines and Algorithms teams.
    • Detailed breakdown of responsibilities is TBD.
    • CET is already working on this for DP0 in conjunction with DESC.
  • Leanne Guy — update DMTN-155 to reflect how to move answers which have already been given on Slack onto Community.
    DMTN-155 updated - please review 
10:30Break

Moderator: Leanne Guy

11:00Improvements to the build and release system
  • The Architecture team is working towards a series of improvements in the build and release system.
  • What is planned?
  • What is the implementation schedule?
  • KTL Build+Package DMLT F2F 2020-09-15.pdf
  • John asks about LSST-dev which can probably be fixed using lsstinstall. Are there other depencancies on newinstall? Seems not (most use containers which embed newinstall; changing container build fixes all of these).
  • RHL asks about Telescope and Site - they build on top of our containers so should be ok
  • GPDF asks about long term stability/availability of conda-forge ? KT thinks it has a multi year horizon, conda has interesting history and future partially supported by commercial company. Conda-forge is much more community based and has a big community.
  • John - who is the product owner for the build system ? - Unfortunately it's KT owning and Managing. Does KT understand who all the stakeholders are ? KT is confident he knows the people with Jenkins jobs and who the user base for lsstsw and newinstall - will go to community in any case.
12:00DMTN-148
  • This is the long-discussed calibration products policy document.
  • Can we sign off on it by this point? If not, what's needed to get us to that point?

Move to have DMLT accept this document. Its a good overview of the situation/plan.

KT shows diagram asks is this the flow ?

  • Kian-Tat Lim attach diagram to this confluence page

How are validity ranges stored : Tim - uses the directory structure and filename.  QE curves come from Camera directly and are imported. Jim - big wall in gen 3 be tween certified and those not yet certified. Export and import deserve the ???  we need more research on that.

RHL not sure squash in there for e.g. images.  CamGeom deprecation was slipped in the document .. though Jim and TIm want to do this but surprised to see it in here. Otherwise happy with Document.

John - if there are technical comments it does not need all DMLT but then we are back to the outstanding action.

Colin - found it difficult to get a feel for what its describing - KTs diagram is a huge help. This may be partially why DMLT have not commented in detail.  John agrees on the contend gave similar feedback to Chris - but nothing from DMLT was taken as all ok not befuddlement. If the latter we should include diagram and update.

Tim - defects easy to handle perhaps its worth having a worked example.  Jim asks if KT diagram works for defects .. Tim says yes but there may be other approaches.

Jim - technote is good for the products which are fairly automatic (human yes/no) not the merged by human ones. John - we need write down we do not know when that is the case.  This is somewhat the case in this doc

GPDF crosstalk corrections are handled ?  Tim - yes. In a given CDB3 instance when you replace a calibration is it replaced (is it bi-temporal). Jim its not but the idea would be to have a new collection not to actually replace the old one (new name).

RHL - all the special cases for detectors are not covered - it may not be a uniform and nice as this makes it out to be. It could be messier when we get to it ... so hesitate to sign off. Back to CameraGeom ....

  • John Swinbank Arrange focused brainstorming meeting with RHL, TIm, Chris , Jim, Yusra, John - to get DMTN-148 further updated . Should at least list all calibration cases even if not solved.  

Jim - how we access calibrations is different to how they are written - may need Robert to propose an alternate design. There is a feasability issue.

KT - best way forward ? 

  • Christopher Waters Kian-Tat Lim Modify DMTN-148 with more diagrams (from KT) and explicit statements about which products it applies to .. and which it does not apply to (and when).  

From zoom:

John - where were we commenting on this document?

From John Daniel Swinbank to Everyone: (8:59 p.m.)
https://github.com/lsst-dm/dmtn-148/pull/3
From Gregory Dubois-Felsmann to Everyone: (8:59 p.m.)

Is what Jim said a couple of minutes ago about what happens in BG3 when a calibration is certified going to be included in DMTN-148?

From Tim J to Everyone: (9:13 p.m.)

I think one of the things is that pipelines just need to be configured to use specific dataset types — that’s the optimal approach for a pipeline. Having every pipeline instead require a composite cameraGeom is overkill

From Tim J to Everyone: (9:14 p.m.)

but from a commissioning perspective it’s clearly easier for Robert to have access to everything in one blob

From Robert Lupton to Everyone: (9:17 p.m.)

I'm worried about notebooks, not pipelines. It's possible that pulling out a set of n parallel data products with the same dataId is OK, but it pushes the book-keeping onto the code. That's not too bad until the code starts by updating some of the values (e.g. the gain). Then the code becomes much more complicated, but if we just allowed setting values on the camera and doing a "put" makes the user's job much simpler. So it's a tradeoff.
So that notebook may become a calib-products "pipeline". But a weird one


QE 12:30Break

Moderator: Wil O'Mullane

12:45Security trade-offs / RFC-723
  • Background:
    • NCSA has instituted a 2FA requirement for the new lsst-login servers.
      • Either SSH with password + DUO
      • Or Kerberos + DUO
    • After authentication, a control connection can be used to avoid further authentication.
    • Kerberos renewable tickets can be used for 25 hours / one week without further renewal.
      • But DUO is still required.
    • At Princeton, it's possible to use DUO + an SSH keypair.
    • Concerns from NCSA that SSH keys stored without passphrases are less secure.
    • Use of DUO at NCSA is required UIUC.
  • Question before the DMLT: how much should we care?
  • We presume that Wil has the authority to define policy and accept risks based on such a tradeoff.
    • It's not clear that this could overrule UIUC policy, though.
  • We assume there is a fair bit of discretion on behalf of NCSA security staff about how that policy is implemented.
  • We could make functional requests of NCSA (“we want persistent connections”), or implementation requests (“we want SSH keys”).
  • Unknown User (mbutler) — understand the parameter space for getting a “long term lease” on an SSH connection to NCSA, and discuss with Wil O'Mullane what wiggle room we have.  
13:15Generation 3 middleware plans and acceptance criteria
  • Present the criteria which have been developed for Gen3 achieving “feature parity” with Gen2, the associated test plan, and the associated timelines
  • Roadmap to Deprecation of Gen2 Butler.
  • Aiming for “Gen 3 ready for general use” by November 1st.
    • Do not anticipate formal acceptance testing on this date; handover will be based on completed Jira tickets, rather than a test campaign.
    • However, functionality is regularly tested in CI.
  • First priority is schema changes; aim to resolve them quickly, since they are maximally disruptive (may require re-ingest).
  • Following this milestone, we should discourage use of Gen2 whenever possible.
  • This milestone will rely on a shared database.
  • However, it is expected that the system is usable at this stage; some things might still be easier in Gen2, but not many.
  • QuantumGraph generation time is being addressed before hitting this milestone.
  • Note that “feature parity” here is explicitly for middleware; Science Pipelines features available in Gen3 will be later, but is currently a high priority.
    • But outputs from Gen2 pipelines can be converted to Gen3 for analysis.
  • Leanne Guy — agree Gen3 acceptance tests for November 1.  
  • Yusra AlSayyad  — provide a timeline for complete pipeline conversion to Gen3.  
14:30Close

Day 2: 2020-09-16

No sessions!

Day 3: 2020-09-17

Moderator: K-T Lim
09:00Milestones

In discussion at the JDR, a couple of issues emerged surrounding DM's milestones:

  • The review recommended that our milestone tracking being more automated / streamlined;
  • Existing milestones being poorly defined (to the extent that the responsible T/CAMs don't know what they mean).

How can we address these?

Recording is on by consent of all for internal use.

Frossie says she did not hear it quite the same (for first point of  slide 2)- automation would be good. But we need a coherent story.  Would be great to have automation for Levl3 milestones - but unlikely to get it.

From chat: problem is that the milestones are not written in quantifiable ways 

Question about lag - yes updates lag by a month.

How do you know which milestones are dependened on by others .. in DMTN-158 which show predecessors and successors.

Could add line for predecessors, sucessors .. Michelle/Yusra woudl like that.

  • John Swinbank add predecessor successor line to milestones in DMTN-158 –  
09:30Team status
  • Each group please provide (~10 minutes total):
    • A brief retrospective on what's happened since our last meeting.
    • Plans for the next few months.
  • Following past griping, order is now determined by the magic of Python...
[ins] In [2]: import random
[ins] In [3]: random.shuffle(teams)
[ins] In [4]: teams
Out[4]:
['SQuaRE (Frossie Economou)',
'DAX (Fritz Mueller)',
'Data Facility (Michelle Butler)',
'Alert Production (John Swinbank)',
'DM Science (Leanne Guy)',
'Data Release Production (Yusra AlSayyad)',
'Architecture (Kian-Tat Lim)']

(Sorry Frossie!)

  • SQuaRE:
    • Commercial EFD replicator is too expensive from Confluent (even at discount).
    • Adding the visit number to the EFD will be handled as part of OWL, but is not an explicit goal of this demicycle.
  • DAX:
    • Slides 
    • Wil suggests using “cloud bucks” at NCSA for APDB scale testing; may also be able to go through IDF. Fritz will follow-up with Michelle.
      • Also possible to push this as a POC on a cloud provider; action on Fritz to write up a statement of work.
    • Note that we need to be careful to distinguish the colour of money: ops money pays for DP0, not construction money.
    • Everybody is encouraged to think about the impact of effort being expended on ops activities on construction milestones.
  • LDF:
  • Alerts:
    • Slides 
    • KT AP Gen3 assumes all raws etc all in butler - Yes
    • KT Alert packet cutout sizes are limited ? yes - more work to be done
    • TIm - WCS is it AP or DRP ? Formally its AP. Dave Berry contract to modify AST to export Yaml ASDF format, WCS understandable by AstoPy. Means any AstroPy user can download Calexp and use our WCS.
    • GPDF - still outputting approx WCS in FITS standard as well as the YAML? Yes no change - ASDF has fits translation format will try to use their scheme.
    • Michelle - running any AP pipelines at NCSA ? Should NCSA start running them. - That would be great trying to move more to the DRP mode but there have been a lot of things holding the team back. In next few months... Ian Eric ..
  • Fritz Mueller — write a SOW for APDB POC on a cloud provider. 
10:30Break
Moderator: John Swinbank
11:00Team status
  • Continued from above.
  • DM Science
    • Slides 
    • Is there a plan to update estimated object counts?
      • Yes, although this primarily comes through the PST. No progress recently, but should look at this on a 6 month timescale.
  • DRP:
    • Seeing similar burnout issues to those reported by other teams.
    • Note that it's hard for Tim Jenness as middleware manager to keep track of what DRP (and other) team members are doing in their non-middleware time (including personal issues, etc); consider having Tim attend T/CAM meetings, or sprint planning with DRP (and other) team members.
  • Architecture:
    • Slides
    • Do we have a clear understanding of who is responsible for solving “the TAP schema problem”? Getting data ingested into the database visible in the TAP service. 
      • Architecture has provided some tooling for this.
      • But linking those tools and providing appropriate metadata is the responsibility of Pipelines and DAX teams.
      • DAX will provide a Felis description of catalog data for ingestion.
      • Requires further work in FY21.
      • Wil will look further into how the responsibilities break down here; this may be an update to DMTN-155 (or it may be elsewhere; operational procedures?).
11:30Quiet Day
  • Following our experimental quiet day
    • should we repeat ?
    • weekly ? Monthly ? 
    • Should it be "quiet, no messages" or "quiet, no meetings"?
    • Should it apply to everyone in DM? A "maker" subset but not a "manager" subset?
    • Should we try to spread this beyond DM to other parts of the organization?
  • Feedback broadly positive (from “it was ok” to “it was absolutely fantastic”).
    • Worry that it might precipitate “large walls” of meetings; might be better with a “less draconian” approach allowing some requests for help.
    • Knowing when it would be well in advance would help people take advantage.
    • Be clear about expectations: days off and focused work days should not be the same thing.
    • Juggling commitments to other groups/projects who were not doing quiet days.
    • Appreciate not having to read the Slack backlog!
    • Introducing “variety” is helpful.
    • What does the perceived pressure to keep up tell us about the overall DM(LT) “bus factor”?
    • Forcing people to take a vacation day is not appropriate.
    • The same quietness schedule may not be appropriate to everybody; maybe different people should have quiet days on different cadences.
    • It should be understood that you can take a day off when you need to without inconveniencing others; a quiet day shouldn't be necessary for this.
    • Broad support for the idea that these days should not be associated with vacation or time off.
    • At least to the non-leadership types, avoiding the Slack deluge may be more valuable than avoiding meetings.
  • Cadence:
    • No meeting Friday is pretty ubiquitous.
    • General agreement that a weekly quiet Friday is appropriate.
  • Tooling:
    • It may be appropriate to send async (Jira/GitHub/e-mail) messages, but with no expectation of a reply before the next working day.
    • Slack is always a distraction, so not appropriate.
  • Should it apply to everyone in DM?
    • Yes.
  • Should it spread beyond DM?
    • That is happening naturally; NOIRlab and Ops are already moving in this direction.
  • Need to advertise expectations to other parts of the project.
  • Should be codified in the Developer Guide.
  • There should be a trial period, revisited in a couple of months; and we should solicit feedback and revisit this discussion.
  • Wil O'Mullane — update the Developer Guide to reflect the plans for a Quiet Day.  
12:00Wrap up
  • Screengrab ..DMLT Photo
  • Review actions from this meeting.
  • Upcoming meetings:
    • Virtual, 2020-11-17/19.
    • Tucson, 2021-02-22/25.
      • MCR booked.
  • Please remember to upload any slides you presented at this meeting to Confluence!
  • February meeting will be virtual.
  • Wil O'Mullane — send out Doodle polls for calendar 2021 DMLT meetings.  


Attached Documents

Action Item Summary

Pre-Meeting Planning

TopicRequested byTime required (estimate)Notes
Build system status30 minutesIn May 2020 we were unable to make a 19.0.1 patch release because of incompatible changes to the build and release system since the 19.0.0 release. The Architecture team were tasked with updating and simplifying the build and release system to ensure that this couldn't happen again (ie, whatever changes are made to the underlying infrastructure, we should always – within reason – be able to reproduce and update old releases). This session is an opportunity to review the plans that were made and the progress towards implementing them.
Community support30 minutes

As we move closer to operations, members of both Science Collaborations and the wider scientific community are taking an increasing interest in using our Science Pipelines and other software. We need to be able to provide them with technical support, without imposing an unreasonable burden on our on-project staff. In particular, in May of this year, specific concerns were noted about members of the community using Slack channels which were originally indented for technical discussion on the DM system to ask for technical support.

Providing a coherent approach to support is challenging, given the wide range of interests and skills in the community, limited on project resources, and the need to  provide a system which both supports the construction project now and which fully transitions into the System Performance department's Community Engagement team in the future.

How much progress have we made since May? Do we now have a coherent message on what support we are providing, and through which channels? Have we clearly communicated that message to the leadership of the various science collaborations?

Melissa Graham I  (Leanne) might call on you to join this session 

L3 milestones45 minutes

In discussion at the JDR, a couple of issues emerged with L3 milestones:

  • The review recommended that our milestone tracking being more automated / streamlined;
  • Existing milestones being poorly defined (to the extent that the responsible T/CAMs don't know what they mean).

How can we address these?

G3 middleware acceptance 1 hourPresent the criteria which have been developed for Gen3 achieving “feature parity” with Gen2, the associated test plan, and the associated timelines.
Jenkins futures30 minutes

The AP team would like to be able to run ap_verify in Jenkins against ticket branches in other packages.

Kian-Tat Lim tells us this would involve a substantial retooling of Jenkins, but that some work in this direction is already in place.

It'd be useful to understand what changes are planned.

(This may be the same as “build system status” above; Kian-Tat Lim and/or Unknown User (gcomoretto) might wish to comment.)

DMTN-14830 minutesWe need to work out a way of finally signing off on DMTN-148. Development is assuming it is accepted but it's still technically in limbo.