CTS: what are you waiting for the dome to move for? We have an antenna on top of the dome
KTL: DId I hear mac minis are in hand?? CS: Yep, they're rack ready
RD: I know Yee is interested in ZFS, would there be any issue with using it at the summit? KTL: There's stuff in the diagnostic cluster.. local file systems. The OODS: anticipating moving that to S3. so that wouldnt use ZFS. CS: For encryption, we're going to use ? and ? for local file system. For Ceph, we're going to use a different approach.
FE: How are you doing staffing-wise? CS: We're hiring a new devops engineer, remote. New ad next week. And some positions opening for nighttime support. We broke a record for sick-leave season.
Data Abstraction
FE: are we losing ci.lsst.codes? General SLAC domain name issue? KTL: This is a separate thing under Richard's section
JFB: Is Jeremy someone who can take ownership of the butler schema column naming convention? GPDF: Yes.
FE: Regarding your questions about your interfaces with the adjoining divisions, let's talk about it between status and What remains to be done 2023
Data Production
GPDF: Is the Alert Archive (as described inDMTN-183) regularly deployed? Bellm: yes, it's part of the Alert Distribution mechanism.
GPDF: When does AuxTel shift to 5 night/week operation? Slater: Not yet known.
FM: cm-service codebase is going to be integrated properly into DM practices, code conventions, and CI.
KTL: A database of daily calibrations was mentioned. Is this a monitoring database or something for DRP? Chris Waters: it's a monitoring thing that goes into Sasquatch. Nothing to do with DRP. Trend analysis to determine when calibrations are going out of spec.
Data Facility
TJ: If user data is in cloud, how are users getting accounts on SLAC for batch going to be able to access their user data? FE: We haven't proven hybrid yet performance-wise, component-wise, and loose story around the prompt products. Answer is not-trivial.
FM: We're worried about cassandra, but we have a plan. Andy S' new worry is about whether postgres for hosting PPDB can perform.
CTS: Your concern about delay in getting SRCF2? RD: Most everything goes in there. Its two components of the same center. Negotiations on which racks go where SLAC vs Stanford. There are 4000 milan cores/20 PB of storage waiting to be installed in SRCF2
CS: The KTL job? KTL: There are some questions about the divisions of the duchies here, but most of it sits in Tim's Duchy rather than Richard's. FE: KT wears many hats and we're not going to get like-for-like. We need to know who is looking after each stage of the life cycle. YA: We need an army of understudies. KTL: My goal over the next year is to hand over everything possible.
DM-SST
TJ: I'm worried about user storage for butler. JFB: This came up yesterday. We need to nail down where butler user outputs are getting stored and whether there's an option for user sqlite dbs. FE: A lot of this work will come back around to Data Abstraction I'm afraid.
Data Services
LG: We need to be prepared for DP1 to be ComCam or early LSSTCam. We should have a better idea in Feb. We should expect an uptick in user interest. You're right to prep for a much larger user base. FE: The press release is prob going to be the biggest spike.
TJ: I assume people want Butler.put() to work in the notebook? No tutorials do butler.put. May I focus on performance of butler.get()? Or should I prioritize butler.put(). What are the priority tradeoffs. WO: Prioritize butler.get for the DPs and butler.put for the DRs. CTS: Reframe to whether people can do a pipetask run. People WILL want to do that for DP1. TJ: we're imagining that pipetask run will put into a local per-user butler registry.
KTL: The DRP data flows to RSP. Are we really that confident? FE: We've done it for DP0.2. Not for hybrid. GPDF: I too am worried about the PP data flow. When we were doing DP.0.1,DP0.2,DP0.3 there were so many little things we forgot. We need to do the same for PP. The time it took for the data model to be curated, doesn't scale to operations. Have a plan for Jeremy to improve process. How big are DP1,2 going to be. Can we do the ingest at scale? I don't know. I'd ask Fritz to comment. We haven't done Cassandra to Postgres yet. Qserv is only going to be used for data previews to the end of the year. In the ops era, qserv wouldn't be used at all until DR1 comes out. Although postgres can handle DP1 and DP2, we've committed to using qserv for those so that we have experience. The full-scale load is going to be on postgres. FM: There's lack of clarity of the interface between ABDB and PPDB. KTL: If you're confident in DRP, then you can be confident in PP. But if you're not confident in DRP, then you should be very worried.
TJ: Ok when do we talk about the interfaces/boundaries of Data Abstraction? FE: I propose we come back and do the end to end life of a photon and then if there are any questions we can talk after that.
GPDF: Renaming the Alert Database to Alert Archive, an S3 bucket indexed by PPDB. If you reissue a diaSource with a new ID what happens?
KTL: Need multiple tables.
EB: Not a lot of requirements on the archive, though a lot of features we might want.
GPDF: True, but there is the embarrassment threshold.
KTL: We can have a library, and a user can run a service in a notebook.
KTL: PPDB replication >< 24hours?
GPDF, original requirement was no more than 24 hours. Could be done in batches throughout the day.
KTL: Are sources assoc. with embargoed images allowed to be published?
EB: From DMTN-199, yes no difference for non-pixel data products.
FE: What exists to test the APDB interface?
KTL/EB: This exists, and is run during Prompt Processing which reads and writes to Cassandra (currently Postgres). Ideal is the APDB replicated to PPDB in real time.
FE: Concern is about staff needing to run tests during the night, if the PPDB is not updated yet.
KTL: It is possible for staff to query APDB directly.
FM: Andy S. will be working on the PPDB replication first thing.
EB: We are trying to set up workflows that do not require direct access to the APDB. We are writing analysis metrics and plots.
GPDF: Who decides PPDB replication cadence? Is it up to Andy S or are we setting requirements?
EB: We are letting Andy S drive. Staff can always access the catalogs directly through the butler, though that is not a database.
FE: InfluxDB was meant to be internal, but users want it all the time. Concerned about a DB that is not accessible. What is Plan A for exposing services to the data in the APDB?
FM: The interface to the APDB is the shim provided to AP, and that is the ONLY interface that will be supported since it is designed for AP performance.
WOM: Note that GAIA catalog is 1.5B rows, in Postgres. APDB queries during the day will not conflict with observing, it's just night-time queries that are problematic for performance.
KTL: Solar System Processing must run on the PPDB, not APDB.
EB: If that is a requirement, then PPDB replication must happen in real time since SSP needs the new sources.
KTL: Now that long trailed sources are filtered, it might be possible to run SSP from APDB.
GPDF: We will provide spatially sharded catalogs from DRP. For AP, we have not promised to provide this as a rolling updated spatially-sharded dataset.
CS: Catch is that direct image sources are needed to do QA on the images.
GPDF: Difference is we are computing source catalogs from AP for our own uses, means we don't have the same requirements as for data provided to the public. Does not have to be in crowded fields or reach our limiting magnitude, for example.
EB: Catalogs can be accessed through the staff RSP on repo/embargo. Possible to run afterburner and provide diaSources, but not diaObjects since those will be changing all the time and can't be represented in the butler.
GPDF: SST meeting said we would make diaSources available through the Butler
GPDF: Where is the butler registry for the embargo rack?
KTL: All of the registries are in the same place. The internal butler server lives outside the embargo rack.
TJ: What about the hybrid cloud?
KTL: No one should be able to 'put' to the release repo, they would need to 'put' to their own repo
TJ: Does that mean the nightly repo needs to be read/write?
KTL: The expectation is that users will need to run their own processing and write back to the repo.
Where does the data rights registry live?
TJ: 10,000 people will show up on day one to look at the PVIs.
RD: George B has asked about replicating the PPDB.
WOM: Added to missing functionality, but not clear we will provide this.
TJ: Where does the server live for Butler-client server?
KTL: It should live close to the DB.
TJ: When a user does a butler put that ends up in the cloud, but must be reflected in the registry.
FE: Note that all user access must be private to that user, since we can't tell what data products are scientific proprietary.
How do images produced by Prompt Processing get to the users? Ideally would ship image data to Google at first for safety, but stop doing that later since we can't afford it.
KTL: Fine with the butler registry server living on cloud, then when the images are published users can access them.
GPDF: Live Obscore only works on a modifiable Postgres server that we can configure to support spatial shards (pgsphere)
TJ: In theory it's possible to use available spatial without pgsphere, but hard to implement.
FE: Really want an outside contributor to provide, since we can't justify in Construction.
WOM: Great project for LINCC, bring up with Andy Connelly
TJ: Are we proposing a dynamic cache, where the first time you do a butler.get it comes from USDF, second time it comes from the cache?
KTL: Yes
FE: concern that user access patterns may defeat the cache
FE: Note that live Obscore was designed to be an internal service, now being used externally and concerned about robustness.
Major question of who owns each service involved e.g. Felis
KTL: Prompt Processing must not access the EFD, all metadata must come from the image headers. Could be possible to refer to EFD for DRP.
JB: Could use enhanced raws, where additional or updated metadata is attached to the raws for DRP.
JB: Would be possible to write metadata such as WCS back to the registry as metrics, which would be queryable by science users.
Main question: What is the technical implementation and ownership?
Need to produce summaries, not just have generic Sasquatch, and be able to archive reports
"Interactive" means that you can click on links to drill down on particular items (like RubinTV); does not need to have internal animation etc.
Example: plot of image quality through the night
All data can be generated from the EFD or Sasquatch metrics or outputs from "10 AM processing", although the last is probably not necessary and might not be available in time
Who is the target audience and the product owner? Survey performance might want different things from observers
There are indeed multiple customers, some of whom have their own tools already
Wil: Chronograf dashboard(s) could be sufficient, but we need to figure out what goes on the dashboard
Could have links to notebooks for more detail
LCR by ChuckC for building out nightly reporting, presumably beyond DM; LSE-490 gives requirements
Frossie Economou Write a DMTN describing what current tooling can produce in a nightly report
MerlinFL has been working on a nightly report.
Yusra AlSayyad will ask MerlinFL to describe what his nightly report will contain.
There are requirements for AP, calibration, Prompt Processing; these are all potentially satisfiable via metrics from AP pipelines.
Detection efficiency for point sources may be a bit more complicated. It's necessary even beyond the nightly report, but how to do it at scale for everything is still up in the air. May be able to do it in Prompt Processing as an afterburner.
A bit risky to load down Prompt Processing with unnecessary tasks
Need to have Consolidated Visit Database in order to access easily, but still need to make sure data is produced.
The Transformed EFD is definitely needed.
Summarizations at end of night have been described as "10 AM"; can likely be a single pipetask command
Don't want to rerun ISR et al., should only use previous metrics/measurements as inputs
Could possibly accumulate data during the night rather than summarizing only at the end
Accumulation over visits is possible except for aggregating over the focal plane
But accumulating things is not great in the Middleware
We can "accumulate" by rerunning a full aggregation over and over
Kubernetes has CronJobs that will be used for Transformed EFD and Embargo transfer
May need to make it easier for Pipelines to use this; SQuaRE has some mechanisms to turn YAML into jobs
Even without that, can make it work with GitOps (update pipeline definition, automatically goes into cron)
Early required reports were static, then early ChuckC LCR-1202 added interactivity which should be supported by dashboards
Need a breakdown of what is theoretically calculable during the night versus what needs true end-of-night processing? JimB: not at this time, as there will likely always be something at the end and anything at the end will be fast enough
Commissioning likely wants more in "10 AM" than steady-state operations
Solar System processing also needs to be launched in daytime (takes 4+ hours)
"10 AM" should really be run as soon as last observation is taken, but we don't get a notification of that
09:45
Friendly developers and external contributions
Present the issue and schedule another focused meeting if needed.
AP is building infrastructure for Prompt Processing but is not supporting it in Operations
SteveP is the only possible person in Data Abstraction but doesn't seem appropriate
OODS, OCPS, Hermes-K, ingestd should not be owned by a single person
Campaign Management could be related?
Pilots are different from the framework
Services need to be supported beyond what Middleware used to do
Wil: This falls into Data Abstraction; should be SteveP and a new hire
SLAC new hire may not be this; should be someone else as well
Need to have officially designated team for execution frameworks and services
Haven't disentangled things that moved from NCSA
Wil O'Mullane Need to do a "delta-scrub" to ask for additional staffing
Is the Data Facilities boundary now "cloud provider equivalent"?
Also includes anything that "disguises" multi-site like Rucio (with Butler integration) should also be USDF
When something breaks, could be service or infrastructure (reliability is not at cloud provider levels), need to investigate, will need cooperation on both sides
Don't want a public AP-like data preview (too much user support), but do need something staff-facing
Sample alerts based on DC2 data will be sent from USDF, but no prompt products in RSP — technology/integration demo for brokers, not intended to be scientifically meaningful
Want to define "Prompt Products Release Ops" based on LATISS as an ongoing internal-only, USDF RSP publication
Biggest technical hurdles are replication to PPDB (and prompt products Butler) and the image metadata model/flow (database server needs to be the same for all products and joinable image metadata)
Shims like having AP write directly to PPDB Postgres rather than Cassandra+replication are acceptable in the short term; can also use Base Test Stand to simulate images
Gregory Dubois-Felsmann Complete DMTN-105 defining the goal for "Prompt Products Release Ops"
We currently do a lot of processing with release candidates; this makes a mess of provenance
The last code used for the last step of a DR is always usable to regenerate the entire release (by process, not technical)
DRs are the only thing we make intentional major releases for; others are for clock/calendar or for verification
Can't have downstream applications rely on main; should rely on stable weeklies or official releases
Some insist that they must have the latest version from yesterday
If we make it easier to patch release versions with things that someone needs, then that might help
Better CI mechanisms might help too: don't limit lsst_distrib developers from merging, but test the application before updating to a new daily/weekly release
Might be easier if low-level packages are removed from lsst_distrib and if "applications" like "DRP" and "AP" and "CPP" are removed from lsst_distrib
RSP uses fixed tagged versions of lsst_distrib
GPDF: Worries about backward compatibility
Tensions between version used to generate data products and version in nublado due to failures of forward/backward compatibility in reading datasets
There's only one version of various RSP services like the Portal; they are not guaranteed to work against all pipeline outputs, as even if they use the stack they only use that one version
Will need to deal with this as DRs introduce breaking changes
In a future DMLT F2F, discuss conda/packaging
12:10
Wrap up - Next meetings/ Meeting free weeks /EPO liaison
Wil O'Mullane Put together a JTM/DM all-hands at SLAC the week of Feb 12
Wil O'Mullane Poll for a one-day DMLT F2F in January to prep for JTM/All-Hands
Wil O'Mullane Poll for a Summer meeting, maybe week of July 15 or later in August after PCW
Maybe Oct 21 for the last one
Meeting free weeks in 2024:
April 8-12 - clashes with SST off-week, propose following week of 15 April
Jun 17-21 - clashes with SST off-week, propose following or preceding week
Sept 16-20 - OK
December 24 - 2025 Jan 3 - OK, no SST week over 2 week solstice period
EPO liaison: Blake wants someone to talk to; often architectural issues — "why do you want to do X?" — which might end up at GPDF or Frossie; see if someone else is interested in EPO? Or maybe dedicate space in the RSP Team meeting?
Frossie Economou Think about appropriate EPO liaison or mechanism
Wil O'Mullane Go through Google Doc of missing items and pull out action items
We have an EPO interface which to some extent Wil O'Mullane has been looking after - EPO have asked who they could interface with more regularly to make sure we are in line. Perhaps we invite Blake for a discussion and see if we have someone interested to be a more technical liason to EPO.
Our theoretical release process has become detached from actual practice and our needs:
Many releases we make don't end up getting used for anything other than satisfying milestones and ticking down the deprecation-removal clock (or if they do, at the first sign of troubles we send users back to weekly, because we can actually support those better).
Our releases that are most heavily used (and most heavily backported-to) are those that are used for major production runs, but the time it takes to go from a release candidate to a release means that we end up actually using release candidates for the production instead of the release itself.
We tend to declare a weekly as the basis for a release only a few weeks afterwards, because it's highly desirable to start with a release for which regular processing has already succeeded. But this means we don't have much warning about when a release-basis weekly is about to happen, and hence can't actually act on blocking tickets.
Some ideas I have include:
Making the weeklies the units for the deprecation-removal clock.
Defining a release process for production-run releases that is initially less exhaustive and lower-overhead, but ultimately much more exhaustive, reflecting what we actually do.
DMS-REQ-0096, 0097, 0099, 0101 and 0394 require us to auto-generate various nightly quality reports. I'd like to discuss the tools we might use to generate those reports and understand who is responsible for delivering which pieces.
We're getting a lot of code contributed for our weak-lensing shear pipelines from DESC-affiliated developers who prefer to largely maintain ownership of their code (which often depends on DM code and otherwise looks a lot like typical DM code). I'd like to discuss both our policies for dealing with that kind of code (when to fork, when to use conda, when to draw a line and insist on taking full ownership), and how to ensure friendly developers we don't pay can contribute to (e.g.) our data release testing and coding in an era where data rights are in play, but commissioning is over.
A lot of questions came up at LSST@Europe5 about if/when we would be providing GPUs on the RSP. The used of ML code is growing in analysis of LSST data. One of the pros of 'users in the cloud" that we have stated is the ability for users to bring resources to the data beyond what the project will provide. Sections of the community want to start doing this. If we are not planning to provide GPUs or enable users to bring GPUs to the data on the US DAC, users are asking how they can get the data out of the RSP so they can take the data to GPUs (possibly IDACs). Leanne Guy could htis be part of whats missing front end ?
Kian-Tat Lim Convene a meeting with Colin, Tim, Robert, Yusra to resolve graph generation with per-dataset quantities (likely based on Consolidated DB work).