2017-08-25 DM SST Agenda and Meeting notes

Date

25 Aug 2017

Teleconference coordinates: https://bluejeans.com/507894076

Attendees

Goals

Discussion items

Time	Item	Who	Notes	Conclusions and action items
60min	Debrief from LSST 2017	All	Michael Wood-Vasey Not clear how AHM is organized (the overall goals, the theme). This remains an issue. How do we strike a good balance of sessions we want (bottom-up) vs. sessions we need (top-down organization)? Simon Krughoff concurs. Mario Juric: make sure to fill out the LSST2017 surveys! Was at the DESC commissioning simulations session, found it useful. Gregory Dubois-Felsmann Dialing in from JupyterCon Summarized the LSP session Lots of talk of specifics of using the LSP, large result sets, reconstitution of data into its Python form from queries The only fundamentally new thing: somebody asking about our ability to provide encryption of users' notebooks We need to continue reaching out to the SCs and make sure the tools we’re building meet the needs of the science they want to do. Gregory is interested to become more active in that outreach: talk to the community how they plan to use our tools, steer them in the right direction, and update our plans when needed. We all need to understand that better. Jim Bosch Spent most of time on Butler/SuperTask; more as a JTM than meeting with scientists Was in PSF estimation & deblending sessions, but didn't learn anything fundamentally new (because we already talk to these folks) Level 3 batch questions: It would be nice to have some clarity from the LSP design on what kind of compute and storage we’re going to offer to the users What is the batch system going to look like? What is the storage system going to look like? Example: If someone wants to rerun a SuperTask in a different way on a bunch of objects they've selected, is that something they can expect to do within the science platform by launching that in batch, or are they limited by the CPUs their notebook is running on and all computation will happen locally, in serial? Or is there a way they can launch jobs? There was a lot of discussion on the gaps in our understanding of how the platform users will interact with the batch system. I.e., will people be asked to do the moral equivalent of `qsub`, or will we have a friendlier interface to the system? Mario Juric thought developing this friendly layer was part of SQuaREs remit (remember discussing it in ~fall/winter 2016)? Simon Krughoff reports it has not made it into SQuaRE's final plan. Gregory Dubois-Felsmann: Would be good to make sure the workflow system we use internally is usable to our users as well; otherwise we won't be able to efficiently share capacity between L2 and L3 resources if we ever wanted to. Cautionary tails from BaBar. Worries that what we're seeing so far from NCSA is a very production oriented design. Jim Bosch: if we are planning to have the users use our L2 workflow, we haven't had those considerations included in the workflow system design yet. Even if we didn't want to offer the internal workflow system to the users, since our developers will use the LSP for development, they should have the same workflow system/interfaces available at least to them. This will also generate additional requirements on the Butler/SuperTask and we should take it into account. It's not clear how all these things fit into the system that Michelle is designing Unknown User (mjuric-admin) mentioned there was independent discussion of this with Don; they're currently planning to offer a pretty "vanilla" batch system to the users. We would "sprinkle" something on top of that (Python API) to make it easier to submit jobs from the notebooks, but the batch system will be fairly classical (HT/Condor). Robert Gruendl: the real worry is how long you allow user jobs to persist and how big they are. Gregory Dubois-Felsmann: I think we worry about one level higher than that – what if the user writes a supertask and wants to run it (in batch) on some subset of objects/images/data. How do they do that? Do they write scripts of their own? Or do they re-use the workflow system we have internally? Robert Gruendl thought this guidance/helpdesk for the users would come from SQuaRE. Unknown User (mjuric-admin): Bottom line: we need to clarify the LSP <-> batch interface(s) and who's responsible for what. Setup meetings to follow up on this Simon Krughoff Calibration products, talking to Merlin et al. How will we take them all (many images to take) Simon reports he's been named the DM liaison to the Commissioning Scientist (by Chuck) Worries about the SNR for various calibration products; if we don't know that, we don't know how many we have to take. Anecdotal potential issues with bias frames Bias not stable in the test stand as one would hope (the electronics float around) Biases may have to be taken during the night. We were surprised by this. Simon says this is anecdotal, based on the test stand. Commissioning rehearsals were useful and fun How will we do releases in commissioning? Simon Krughoff expects we'll make releases potentially even twice per day early in commissioning How do we (DM) support operations (third shift)? Michael Wood-Vasey concurs these were interesting, JTM will be even better. Robert Gruendl Remote access was not good; frustration; bad wifi/connection, bad audio Make sure it works when it works or don't do it at all AuxTel status has him worried Lots of TBDs even for hardware on how it’s going to come together Designs not solid even for things that are due to be built soon Worried we may make wrong decisions (or repeat things to get it right) because we have to do them and don't have time to do analysis There was some discussion on remote access Robert Gruendl: if you'll provide remote access, make sure it works. Or focus on providing good remote access only some of the time (i.e., not for all sessions). That way people can attend only a portion of the meeting, but usefully. As long as we're using hotels for this, he's worried we won't be able to do remote access well. Gregory Dubois-Felsmann: it's not realistic to expect high-quality connection to many sessions with the effort spent & the venue chosen; depends on too many variables. Mario Juric: I see this work well elsewhere; confused how we experience the same problems (at least since Bremerton). Don't think it's a fundamental issue with these kinds of meetings, more with our organization. Strange that we wouldn't want to pay for a wired connection in each room, or have speaker laptops/speakerphones/webcams ready. It's peanuts relative to what it costs to organize this meeting. Simon Krughoff: Thought Victor made it clear in a plennary he doesn't want to make it easy for people to attend remotely, to increase participation? Mario Juric got the opposite feedback (incl. promise after last Feb JTM that the project is buying "meeting in a box" telecon equipment to make things just work). Never seen that box in action. It's frustrating and causes trust issues. Feel like we're being told whatever will make us believe the problem will be fixed & go away. Michael Wood-Vasey: feels that the project has an implicit if not explicit policy on this that we disagree with. For a small investments of effort they could do noticebly better. At least stream the plenaries. But in parallel session it's hard; you can't do it because it takes more technical support. Melissa Graham SAC meeting was interesting Went to the science sessions Contributed to information on data processing Unknown User (mjuric-admin): Note that 10% for SPs is not the same thing as 10% for Level 3; the "10%" is the same number just by chance. The 10% for SPs are allocated within the normal "Level 2" budget (or should be). Melissa Graham: Need to confirm that w. KTL et al. Colin Slater Most useful, beyond the sessions, was lunch we had with Fritz Mueller and Donald Petravick on various map-reduce type options for data storage & processing, and how those relate to user experience. We don't have a very good story on how we'll do next-to-the-database processing; we're working to understand this better. Still in early stages Trying to better understand our database requirements from the PoV of the user Current requirements are too simplistic. Trying to give more info to Fritz on what the real requirements are. Zeljko Ivezic Muted by an oboe player Mario Juric Note: I had prepared the notes in the italicized text below, but didn't get a chance to discuss them during the meeting; including them here in case anyone wants to comment. SAC Meeting Interesting preliminary communication survey results Most people use mailing lists, conferences Relatively few people use community Desire for a better website This all may be biased, had a small sample at the time Ask Beth for full results Ask Beth for her presentation Jim Bosch points out that the survey questions may not be distinguishing between people who've never tried community, vs. people who've tried to use it and hated it. I presented how we’re organizing the communication with the community at SST level Points of contact concept We haven't enacted this fully because of the replan; need to do it now Have to become embedded in the Science Collaborations, to understand the community well This will also allow us to stay more active scientifically ACTION: Name the PoCs “Two communities” Software stack — already supporting Broader community: need to support better (PoCs, conferences, etc) Should have a website, collection of information, targeting the latter community IPAC WFIRST website is an example Thinking about raising this to the level of the PST Gregory mentioned IPAC experience Have been positively biased towards the stack community We all need to be careful when we address the community; it’s not appropriate for everyone to run the stack The stack is primarily an internal product; most of our users will never use it. Beyond internal LSST use, it’s goals are (LDM-294) Primary: Reproducibility and documentation Secondary: Enable systematics-limited science that requires partial reprocessing (DESC) Tertiary: Enable new surveys to re-use LSST codes I’m worried we’re confusing most of the community by placing a strong emphasis on the stack IMHO, need to develop demos where the LSST stack is just one component (or not used at all), that are more oriented on how LSST-scale datasets and the LSP will be used Had a meeting with the DESC, they’re worried about the level of support we’re offering them Communicated we’re already supporting DESC a lot more than other collaborations: IMHO, They need to acknowledge they’re understaffed for what they're trying to do (orders of magnitude large productions, with code that's still prototype and poorly documented, that they don't understand, and with very few resources – on order of ~2 FTE, if I understood correctly) Advised they should look for more resources, should they want to run our code at this stage. The best option may be for them to embed someone into DM pipelines Action on me to talk to Phil Marshal to review ther SRM and sync it up with the DM pipeline, pull in other SST people Simon Krughoff they realize it's hard, they're thinking about mocking pieces of DM that don't exist. But that's also really hard. Worried about it. Mario Juric would advise that they try to lag the DM schedule, rather than be ahead of it. That way the features they use have already been tested by DM. Michael Wood-Vasey wonders how much time are we really spending on DESC? Are these things we'd want to fix anyway? Jim Bosch worries DESC is trying to get involved too early. Relying on our milestones even we don't believe in & consider flexible. The history of previous survey has been you don't do what DESC is trying to do before ops. Should manage expectations. May be a programmatic issue with them. Commissioning planning Chuck’s developing a commissioning plan this Aug through December. We need to participate to help him fill in the blanks. I think Michael W-V would be a good point person here, with input from Robert Gruendl, Eric, Colin and the rest of us Sent an e-mail to Chuck about this Meeting on RFCs Didn’t have Gregory Dubois-Felsmann present, so all actions are tentative To send e-mail to Gregory and everyone with recommended actions LCR-908: Keep all Data Releases loaded in databases Enact a requirement to build a system capable of retaining all DRs Store Processed Visit Images (PVIs) and Make PVIs Available to Users (RFC-325) Reformulate the problem as latency requirement (e.g., on the UI) 1-2 seconds to display JPEG-y “movie strip” of the object Need to understand the number of users Retain all raw images on disk Est. ~$600k in Construction, $100k/yr in ops, if done Enables reconstruction of reduced images in ~20-30s Main use case is likely to verify the compressed calexps are fine (as they’re otherwise expected to be usable for science) Establish a lossy compression WG, aim to retain lossy compressed calexps on disk These would still be science usable (e.g., both DES and ZTF compress their data this way) Est. <$2.5M in Construction (likely closer to ~$800k), and <1.2M/yr in ops (likely ~$400k) Solar System Data Products session: Started the discussion with the SSSC on updating the SSObject schema They want to understand what the project will offer, vs. what they need to organize to build themselves Gathered proposed input, will be integrating it into a proposed SSObject schema update Data rights issues brought up, need to follow-up with Beth Commissioning rehearsal: Looked good; made us think how the commissioning will unfold One thing that was curious to me was the assumption that the DM lefolks will gather in Tucson Would it make more sense to gather at NCSA? Arguments: Data center and the “Mountain” are the two teams the DM developers will need a tightest loop with while we’re learning how to run the system. “Mountain” is in Chile, too difficult to send everyone there, no room A few people should be there My guess is that most issues will be about getting the developer codes to run in production conditions at NCSA Tight look of Ops and Dev people should be extremely valuable here This argues for gathering at NCSA	Gregory Dubois-Felsmann will convene meeting(s) to understand the state of and clarify LSP ↔ batch interface (maybe a WG?). There's a concern a) we're designing the workflow system so it may be difficult for the users to reuse it, b) a concern that it's not clear who (if anyone) will write the user friendly LSP ↔ batch interface interaction layer. See the Notes for the details). (Note 2019-03-06, This task was re-tasked in 2018-05-21 DM SST F2F meeting to create a ticket for this work) Leanne Guy to create a ticket for Robert Lupton to follow up with Steve Ritz or Chris Stubbs on potential issues with biases floating around (see Simon's notes). Also follow-up on issues with the lens that were mentioned off-hand in the plenary (there are no CCB records of it). 2019-03-07: Following discussion with Robert Lupton, the issue of the lens was not followed-up, Robert does not remember the exact issue Mario Juric should report back to the organizing committee the difficulties with remote access (see Notes for the details) Melissa Graham to follow up with Kian-Tat Lim to verify that the sizing model includes the allocation of compute and storage capacity for special programs. Mario Juric to ask the communications team for an updated community survey. Mario Juric will start the process to name the PoCs for all SCs (done: see here) Mario Juric To send e-mail to Gregory and everyone with recommended actions on LCRs Mario Juric To discuss with Beth Willman data access peculiarities for Science Collaboration participants w/o DAC rights (UK being the primary example): if a SC decides to build common data products in the US DAC, their non-DAC-rights members won't be able to access it. To confirm that this is the policy. Everyone: please fill out LSST2017 exit survey Everyone: please upload your presentations to the LSST2017 website @Mario Juric Schedule a Doodle poll for a new meeting time during this semester.

Space shortcuts

Page tree

Date

Attendees

Goals

Discussion items