Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Day 1, Tuesday 23 February 2021

Time (Project)TopicCoordinatorPre-meeting notesRunning notes

Moderator: Yusra AlSayyad

Notetaker: Simon Krughoff

09:00WelcomeWil O'Mullane
  • Introductory remarks
  • Review agenda and code of conduct
Slides for Quarterly https://docs.google.com/presentation/d/1g6GrtisnIqMvY75t1C4Epx_n2JZCUhOF0zS9bGoSyok/edit#slide=id.gbe9b77a59e_1_0
9:15Project news and updatesWil O'Mullane
  • FY2
  • Subject to change: End of construction is currently 08/2023.  Beginning of Ops is 10/2023.  The difference is construction schedule reserve
  • Updated date for ComCam on Telescope – July-ish.  Subject to change.  Uncertain for on sky, but hopefully end of year.  New milestones upcoming
9:45Ops RehearsalWil O'MullaneMoved from 9am PT Wed
  • Quick check in on OR#2 for commissioning - we put this off for Gen3 and ComCam move to summit etc.

    • Now we have Gen3 repos -
    • ComCam is up on Floor3.
    • Should we do something to ensure we are ready for the integration on Floor 3 ?
      • EFD , ComCam, Hexpod , cable wrap ..
      • Or a repeat of last time but with Gen3
  • We will do another Ops rehearsal.  It will be Gen 3 focused.  Schedule TBD.  ComCam is now on the summit, so we have options. RHL: it will be good if we can make it work for us to learn things we need to learn.  

10:00

9:30

Provenance WG

Moved from 2pm PT. 


  • RHL: What do you mean by logs?  FE: Everything.  Not sure how to rendezvous camera logs, with pipeline logs e.g.
  • GPDF: These are process logs, not science log books? FE: Correct
  • KSK: We are not doing hardware provenance.
  • KTL: What about provenance of experimental hardware?  GPDF: We will be linking to the maintenance management system (MMS).  That should tell you what hardware is where and when, but we are assuming that system will be fully functional without input from this working group.
  • KTL: It's possible that the YAML description in obs_lsst could get out of sync with the on summit reality if these are not intentionally linked up
  • TJ: Detector serial numbers are directly encoded in the image headers, so no need to go back to the MMS for that.  RHL: But raft serials are not
  • RHL: There are systems that may be dropped that the WG could raise as important sources of provenance that should go forward.
  •   Gregory Dubois-Felsmann : Will help with word-smithing in the report to call out the need to preserve any camera metadata/telemetry that is not ending up in the EFD.  Due by end of Provenance WG tenure.
  • FM: Sizing model?  KTL: If there is a recommendation to get rid of heavy footprints, it will help since they are currently in
  • FE: Source IDs are a concern.  See the report with recommendations.
  • GPDF: Do we need to not worry about compressed PVIs? KTL: We still need to worry.  We are assuming a compression factor for those.
  • RHL: When is the report due out?  FE: By next DMLT F2F.  RHL: I'd like to see a draft.  
  •  Frossie Economou Will circulate Provenance WG document when draft is ready. 
10:00Container builds

Moved from 11:45amPT

Looking for efficiencies in:

  • Container builds across RSP, Prompt Processing, DRP, Telescope & Site
  • Non-service non-stack packaging
  • KTL - should we built T&S software with stack? which with same tech or at same time ?
    • Merlin would like the entangled part released daily ..
    • It would be good to notice if something on T&S breaks when we update stack ..  problem is finding out later.
  • Tim notes in chat that PanaDA now using lsst_distrib containers.
  • RHL - painted ourselves in a corner - we should now ask on the mountain we want to be able to developer type things – need to get together and decide how to achieve that .. feels the docker method is more aimed at stable releases.
    • Not entirely agreed - containers do not make it slower necessarily..
    • Conways law - currently Science Platform, T&S Software etc all under one lead now - not in ops.
    • Pick a day in march to try to settle some of this .. seed some tech notes ..
  • JB - also need to look at release and patches - rather than using the bleeding edge .
    • this was suggested some people recoil from it
    • only way we currently test T&S and Stack is on the summit .. need some tests (Notebook can be driven by nublado)
      • KTL intent was to use NCSA test stand for this - but AuxTel was not running there
      • RHL says it may be Tucson test stand .
  • Colin - whats the rigidity of T&S versioning situation
    • Fundamental T&S issue with XML - interface change requires all to be built ..
    • But it may be the way we are packaging this with stack which is the problem
    • TimJ points out Nublado is used for observing which was not supposed to happen - Nublado was meant to insulated from this.
    • Go back and create cleaner interfaces ..
  • KTL - T&S does not have a build engineer in ops - do we we actually have anyone to do any out put
  • Interfaces or Monolith ..
    • some background given on lack of control system and why notebooks came in to being
  • GPDF - could use cookie between two containers - but htis is the discussion we should have in the meeting

Frossie Economou Doodle poll a day in March for a discussion on container builds with coordinated group DM and T&S ..  

10:30Break

Moderator: Simon Krughoff

Notetaker: Wil O'Mullane

11:00Campaign ManagementKian-Tat Lim

DMTN-181 draft note on campaigns

  • FM - s this agnostic to which things are colocated on site (slide 5)
    • assumes single site .. but workflows should be single site but campaigns across sites
  • Frossie- is campaign offline or data drip
    • would include data drip ..
  • GPDF - slide 4 - pipeline not mentioned in definitions is workflow a pipeline execution
    • some discussion on that TimJ clarified (sorry missed it)
  • GODF  Marking intermediate products bad and keeping them out of downstream processing was a critical feature of the _BaBar_ workflow system.
    • JB - It is important - dataids may not be enough then -
      • exclusions may change for new campaign
      • may be better to record it more explicit to the campaign ..
    • TJ - not automated enough - no plan for using generated metrics in down stream . The provenance records which inputs were used.
      • KTL some up front setup from humans - pipelines should take care of some of the
    • FE - is there an interface that could be used by Computers as well as humans ? DO humans have to be involved in nightly
      • yes and no – some human needs to configure the nightly processing .. no one should have to push a button
      • Implementation is important - do not want to encourage very manual intervention
    • RHL - exclusions from looking at a flat are not different to QA systems flagging things .. its a continum. We will find things in the middle and say we should have flagged this - can we go back and figure out which products are affected ?
      • TJ - we absolutely need a way to know that if Flat X is suddenly determined to be bad then we can find all products that used X and redo them
      • if you change the campaign its a new one - you dont have to run all of it again of course (if gen3 allows you to use partial workflows.
      • RHL want to continue the processing with some new exclusion .. not necessarily a new campaign - it may be semantics .
      • TJ Regarding provenance, you could imagine that we could rebuild a graph from the provenance of that product and just remake that product - JB agrees
    • TJ - does this cover which data was processed ?
      • not this system
      • holes in processing would be important though ..
      • KTL thinks perhaps this would be an addon "linter" ..
      • GPDF - if campaign is immutable and runs for 8 months need something which tracks 
        • could include campaigns
        • I think I'd like semantic changing of versions to just be one case of semantic versioning of software releases, with all config changes recognized as essentially the same as a change to the software.
      • JB - levels now defined work if we push them down to level of workflow
        • I think I'd like semantic changing of versions to just be one case of semantic versioning of software releases, with all config changes recognized as essentially the same as a change to the software.
        • Iterations add workflow in the middle of pre existing workflows ..
        • progress tool needs to consider how the chunking was done
        • RHL this implies sematic versioning
    • WOM not all campaigns are equal .. some we restart and interact with some we must always start from the beginning.
    • FE -  do not want two ways to get the processing done - manual and automatic, if we do not put pressure on automation it will not happen. We would not want to be manually processing data after 10 years ..
    • KTL - trying to find clear places where automation can be added.  Would like more pressure to get automated ..
      • Not yet an assignment of who will build his - can not promise the scope
      • So want a framework that could be used
      • FE - when we have made things simple before its bitten us - we should define what we need and get a high level developer to implement it - not just anyone can pick it up. 
    • RHL - does not understand the worry about (lack of) automation - will do small QA runs and we will inject, KT has defined mechanisms that allow that.
      • Frossie once you provide tooling for humans it takes effort and a better system could have been built to automate it
        • SK .e. if the name of the person running the pipeline changes the configs would evaluate to true, but if the size of the aperture for aperture correction changes it would evaluate as false
        • TJ - problem say  good seeing - want the graph builder to understand it right ? or is a user making a query to do the query and get the list of raws .. what about 10million input files - calculate metrics which are used in down stream tasks .. if its not wanted we do not need to build it
        • JB assume we were building this ..
        • KT an interface for a list of exclusions is only an option not the main way - there should be a selector function built in
          • RHL offers HSC example in deep fields  - TJ does not think anyone is AGAINST exclusion lists
          • JB we already have a system for this which is not using an explicit list ..
        • YA In chat pointed out the metrics may be selected on if stored as datasets - they do not have to be in the registry.
    • GPDF - .e. if the name of the person running the pipeline changes the configs would evaluate to true, but if the size of the aperture for aperture correction changes it would evaluate as false
    • TJ in favor of exclusion lists - there are multiple levels as part of the observatory, everything else should go through single frame processing , some you will never want in a coadd
      • its more complex - ideally it would always come from the metrics ..
      • different selector for each coadd (YA)
    • CS - easier to think how it all interacts when we have a design for straw system ordinarily .. particular mapping of campaign to workflow below it.
      • list of BPS commands put in a form
      • how you come up with that list is another intellectual/science problem
      • but that is exactly the part that needs to be worked out
      • RHL suggests do this concretely for HSC ..
  • TODO
    • Seems fleshing out the external tools (slide 7) would be useful
    • tooling to generate the BPS lists and exclusion lists
    • provenance from previous campaign to come up with new one ..
    • concrete example with HSC ..
    • Overarching archtecture





12:30Break

Moderator: Frossie Economou

Notetaker: Kian-Tat Lim
13:00Alert Distribution & Brokering
  • Status report on the SAC review of the Broker proposals 
  • Discussion of a proposal for a "hybrid" alert distribution system (dmtn-165.lsst.io); implications for the alert filtering service and alert DB
Broker selection
  • Got 9 proposals in Dec out of 15 letters of intent
  • All wanted full stream
  • Most likely will want at least 7 rather than 5
  • Do MOUs come from Ops project?
  • SLAC might have more bandwidth outbound to support more brokers
  • Could also relax latency, cut contents, or provide streams in the cloud with user-pays
  • Can we support user-pays at SLAC? Difficult, not metered
  • Make use of "smart networking fabric" across borders? Possibly
  • Are there support costs or other issues that might be hidden?
  • Should be discussed on the Ops side of the fence
  • 10 Gbit baseline might be per-node, so achieving larger could be reasonable
  • Conclusion: Don't forestall any SAC moves to expand the list of brokers

Hybrid alert concept
  • Unlikely to be able to build a usable end-user Alert Filtering Service by the end of construction
  • Previous options: descope AFS and leave to community brokers or outsource
  • Instead, use hybrid alerts: small notification packets with separate large downloads
  • Small = ~200 bytes/alert; expect minimal overhead (not VOTable)
  • Full alert backing store can be the same as the Alert Database
  • Alert Database is archive of all the alerts independent of filtering
  • Can set rate limits per user
  • Direct access to notification stream and full alerts would be restricted to data rights holders
  • Advantages:
    • More users
    • Bring in outside data
    • Filters in any language/system
    • No monitoring of performance/security
    • Rate limit can be user-managed
    • No on-project processing
    • Don't need to handle user filters
  • To ensure equity/access, need bootstrapping to get people running easily
  • Couldn't brokers do this? Some have mentioned it but none do now
    • Project-provided gives perception of stability
  • Full-stream brokers might also use it
    • Extra latency is probably not large (but latency to insert into Alert Database might be a problem)
Wil: Initial thought: we have community brokers, don't need another filtering service, don't have effort to build it, so descope everything
Don't try to do hybrid alerts? If A&A is needed, complexity goes up; could possibly be farmed out to others (Antares?)
Discussion in chat about VOEvent serialization (there is one in JSON) and transformability into web pages (like XSLT for XML-to-XHTML)
May still need internal filtering of alerts, but that would likely be before publication to stream and database
Frossie:
  • Could leverage RSP interfaces for A&A
  • Possibly leverage Butler Registry server (signed URLs)
  • Could write a template for boilerplate of subscribing to notification stream
  • Not clear that there is a lot more code needed
  • Maybe spend a day with RSP team to determine how much
Zeljko: Why is code running on RSP OK and in AFS a problem? Running in independent containers works for sandboxing but is less efficient
Colin: This could be better than many of the actual broker proposals if presented as one
Gregory: What are the computational requirements for processing the notification stream? Doesn't seem huge, but need to calculate
Possible EPO synergy

  •  Eric Bellm Frossie EconomouDiscuss how RSP interfaces can be used to enable the hybrid Alert model and determine how much extra coding is needed  





14:00Close

Day 2, Wednesday 24 February 2021

Moderator: Wil O'Mullane

Quick check session. Notetaker: Ian Sullivan





9:00Codifying Slack etiquette
  • Thread usage
  • @ on every message ("DESC-style"?)
  • @channel usage
  • Do we need to write things down?  If so, is the Dev Guide or Community or somewhere else (DMTN?) the best place?
  • Should we be prescriptive or suggestive?
    • Should document expected behavior to help new users
  • Could include in name of channel the expected culture
  • Encourages use of text snippets instead of massive code blocks
  • GPDF: It can help if the original poster solicits replies to be in a thread
  • FE: Impossible to mandate culture
    • the dev guide is really useful in that it also lays out the expected culture
      • Team leads point new users to it during onboarding
      • Problem is that we are having people join (at ~20%) without onboarding
    • Threading is controversial, people are afraid of missing things
      • YA: Uses a thread spool emoji to encourage threading
      • SK: Threads can also be to focus the conversation between a couple people, without 50 people chiming in
        • want it in the channel, to still keep the conversation public
    • The support channel is special, FE makes sure to re-read everything there every week to make sure nothing was missed
    • IS: problem has often been non-project people joining DM channels and unwittingly breaking cultural norms
      • JB: the people who follow the dev guide aren't a problem, it's people who join from outside that are not likely to ever look at the dev guide. 
      • JB: A message people receive when they first join would be more helpful
      • WOM: A welcome message when you first join would work, but that only comes when you first join Slack
    • WOM: DEI discussion brought up the Tavern in an unfavorable light, could make clear that
    • FE: Putting the standards in the dev guide allows us to own it
    • FE: LSST slack is essentially now a US astronomy Slack
      • We are outnumbered
      • Worried this will become a big problem for the support channels
    • FE: Could consider making Rubin-Ops only slack
    • WOM: We should create a dev guide page documenting Slack culture.
      • New users can be sent a link to that page
  • MG: reference DMTN-155 and include Melissa Graham in drafting the text
  • FE: If we have to rename channels (such as support channels), those should go through RFC


  •  Write RFC for renaming channels to make support nature clear Frossie Economou 
  •  Write a Slack user guide for the dev guide Kian-Tat Lim 
9:20QA plots/site

Slides

Leanne and Colin wish to have publicly accessible QA plots  (from  pipe_analysis etc .. ).

There is a docker image for the site :

  • this could be spun up on a login node
  • or it could (should?) be deployed  on the cluster using ArgoCD etc ..
    • but then it will not be public
      • unless we deploy on google and push the data to it ..
      • then what ever does the pushing should be deployed properly
  • TJ: The plots are just reading from the filesystem
    • Will be more development in the future
  • SK: Plans to move to Kubernetes, but this is just a python process for now
  • KT: The page is serving the plots, so there isn't/shouldn't be a link to the plots themselves
  • WOM: If security locks down commissioning data even more, then these plots could be left public
  • KT: While NCSA services are behind the VPN, these are non-privileged containers and don't have to be
    • MB: These all access GPFS and so are indeed behind the VPN
  • FE: if you want authentication, we need to consider adding these to the science platform
    • Is this a one-off, or a template for many future services?
      • WOM: these things never remain a one-off, especially if they're public. 
    • MB: It's not just A&A, it depends on which nodes it operates on
  • WOM: this is currently at NCSA, but will also need to be at IDF, USDF, IN2P3
  • JB: need to carve out room for this to run on DP0 without using the remote Butler
  • KT: It can run as a service, does not need remote Butler for now. We can set up something now in containers as long as it is temporary.
  • WOM: Temporary containers are OK as a prototype
  • CTS: what is the path forward? 
    • FE: could possibly run this at NCSA without a VPN under some circumstances
    • FE: need to decide if this is a roadmap, and if it is plan how that leads to the Science Platform
    • TJ: If it's in DP0, we need a proper web service
  • MB: short term version is just for the developers, so maybe it should remain behind the firewall
  • CTS: is the only way out of the firewall a new kubernetes cluster?
    • WOM: that's a topic for an extended discussion
9:40Handing over the Community platform to the Ops.
  • DM is not scoped to provide support in construction.  Now that we have CET funded in pre-ops and they are building a model for community engagement in operations, I'd like to hand the management and evolution of the community platform to the OPS-CET
  • DMLT_CommunityPlatform.pdf
  • KT: Is Jim Annis a moderator? MLG: Yes
  • FE: can we use "deliver to operations" not "handover to operations"
    • DM/DP will continue supporting the service, emphasis that it is for in-project use as well as community service
    • CET has authority to design front page, assign moderators, etc.., need to maintain private groups/private topics for internal communication tool for DM
    • We could write this division of responsibilities down in a tech note
  • RHL: It is hard to transmit project knowledge to CET. support channels are said to be essential, but are not sustainable
    • MLG: continued participation of DM expertise is essential
    • RHL: How does the DM side of support scale?
      • WOM: That is an operations issue, and not a DM issue
        • CET is meant to be the curator and first line of defense against science questions.
        • That means DM/DP does not have to monitor Community, but CET may call on specific experts to answer hard questions when they come up.
      • FE: We should discuss this when Leanne is present
  • FE: We will test this with DP0


Moderator: Kian-Tat Lim

Notetaker: Yusra AlSayyad
10:50Update on DMTN-139Gregory Dubois-Felsmann

At the 1-11-2021 while reviewing DMLT ticket DM-15198 Gregory offered to give an update.

TJ : Isn't the exposure table just a table with all the fits headers and not a derived quantity?
GPDF: I don't remember that being the case.
KT: Confirms. DPDD doesn't even have a schema for the exposure table.
GPDF: I'm more worried about the visit table, which has metrics.

TJ: Could we do a join of all the squash metrics and a link to the exposures?
GPDF: Yes, and can we map that onto CAOM2? We previously acknowledged that CAOM2 was awkward to use in production, but I think we can still use it afterward. This is lookback.
TJ: Ah, It's not the concept of visit, but the actual post-processed visit.

FE: Big Big fan of cutting down data models. There are way too many now, and it's hard to deal with them. I would love to go to the two that Gregory has suggested.
KT: As long as there are two and not more that...dynamic observing.

Slide: 9 Metadata creation and loading workflows. DRPs.
K-T: This also relates back to campaign management activities.
GPDF: If we can extract gen3 into the obsCore data model, we can do that on the fly and get a respectable image browser. There's tooling that can use that.

Slide 10: Nightly Processing 
WO: It is looking like we will be asked to hold > 6hr.
CS: The metadata is useful for interpreting the alert stream itself. It'd be weird if the metadata for the alert weren't available until 24 hours later.
GPDF: Yes, we should be able to at least record that we HAVE TAKEN an observation. People would be able to deduce that from the alerts anyway.
KT: Some metadata is OK to release. The fact that we took a picture, what airmass or seeing is not a problem.
WO: People are ONLY worried about the pixels. Everything else is OK.
KT: So now we're back to <6h release of the metadata.
FE: It's frustrating, but we understand what they're worried about. We should get ahead of it and say: here's what we want to do without releasing the pixels. What makes me nervous is that'll anyone can call Wil and tell them that they have a draconian solution to this problem he has to implement.

GPDF: I know Frossie has opinions on this.
FE: Having a mode where we can serve static files (avoiding computation which is required by the baseline image service design)
KT: if everyone's looking at the same asteroid, we can cache it. There was also that statement that "If people are using the pixels and not the catalogs, then we failed."
Notetaker's editorial: boo
WO: That's basically an FTP server.
FE: noooooooo

GPDF (running out of time): Frossie, take a look at the tap slide.

11:20APDB UpdateFritz Mueller
  • APDB Cassandra scale experiments are concluding; summary report on recommended design and hardware requirements.
  • Coordination discussion w/ AP team: when/where/how will at-scale Cassandra APDB be integrated into ongoing AP development efforts?
  • Slides

KT: sometimes, I found string reps of numbers compress better than binary reps. It ends up being more verbose but compressing better.
AS: In this case, it's not numbers that are the problem.
KT: Sounds not worth investigating.

YA: Reconciling this what you said earlier about pandas being faster than afwTable. So converting to pandas dataframe is slow. But, converting to afwTable is slower?
AS: Yep
EB: The AP pipeline uses pandas and I'm not comfortable committing to refactoring at this point. 
AS: You can save money by having a smaller cluster.
CS: It matters where you're doing it because it determines what you build into the AP cluster vs. the APDB cluster. Not fair to include the timing of the client when measuring the scaling of the DB server cluster.
FM: I just want to bring to your attention that this conversion takes time, and you should be aware that it's there.
[Notetaker's aside: There is lower hanging fruit on the client side]

IS: It would be nice to see scaling with a factor 10x higher source density for the performance of outliers.
AS: I'm worried about averages. The total number in the database is what's important.

FM: Cassandra has been holding up as a horizontal scaling strategy.
$ In cloud vs. vendor: 1-2 years is the breakeven point
The good thing is both are fine. The bad thing is it's a tough decision.
MB: Make sure you include the test systems too! Also, can you grow this, or do you need 12 right away?
FM: It'll only save us half a year at most. But I'm open to your advice.

AS: Code is on a separate branch that has diverged significantly.
EB: We took some action with Michelle and co and are using Postgres at NCSA. It's working out. We're working on gen3 migration issues and schema changes that we can put in. We have a way forward that's functional. It wasn't obvious that even during commissioning, would we hit the scale where we need Cassandra.
How and when should we test this. Maybe we can separate it from when we go on sky timeline.

FM: It sounds like you're OK with your Postgres solution at NCSA. We'll evolve that with the Cassandra API over the next few months.
Do you think Postgres will get you through commissioning?
EB: evolves with how long commissioning, but have to check with commissioning and the team.

Wil: How big is it?
25-30TB per year? total 300TB
KT: Archive it
KT: Google has a way to snapshot your disks. Oh no google? OK, well, still archive it.






12:30

Moderator:

Notetaker: Frossie Economou
13:00Team status


* UW

- Was the bug detected with the fake source injection pipeline an
apperture correction bug? Yes.

- How well are things working in Decam HITS and bulge fields? Technote
for bulge data. Differencing artifacts remaining. Please ask Ian for
more numbers.

* Princeton

- Please leave Jim alone on Focus Friday

- WHat is the limit factor of the 3 iterations of HSC-R2? Waiting time
is less for Gen3. Having more reruns would help "slightly"

- Is there any way to get Leanne's group to look at Gen3 outputs?
Triage process requires a lot of experience from Lauren.

* NCSA

- Tucson teststand PDUs delayed till April -- RHL in chat

- No change in development, NCSA still working with Tony on main
camera -- Wil in response to slide

* Architecture

- Build engineer ad out, please share -- Wil

* DAX

- No questions

* SQuaRE

- Where are we with the Science Platform landing page? squareone under
early development, released next halfcycle

* Science

- Has Focus Friday really impacted Stack Club? [Resumption of earlier
discussion about better cover for Stack Club by going back to
assigning people to mind it - Wil will discuss slide phrasing]

* Wrapup

- Thanks to all for a good meeting

- Let's use Slack instead of Zoom chat for side discussions next time
[but then we can get distracted -- KT]

- Action review







14:10Wrap up

next DMLTs:

  • 2021 June 8-10 - Clash with Penn State Stats
  • PCW not in person
  • 2021 October 26-28 - Clash with ADASS ... move?
  • 2022 February 15-17
14:30Close

...