Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

https://bluejeans.com/426716450

Participants

Agenda & Notes


ItemWho Pre-meeting commentsNotes & action items
1Past SDM standardization effortsYusra AlSayyad
YA: PIPE-24 appears done from Science Pipelines perspective.
A couple years ago, Fritz found that the schema had changed, and asked that it be kept consistent with the DPDD
The schema specification and the Science Piplelines output standardization have been moving closer together.
CS: Is it worth making sure that all of the things that have been done to date have wrapped up, and there are no more details to finish?
GD: What is the status of the standardization tooling, vw the Gen 2 → Gen 3 migration?
YA: Presented outstanding DRP tickets. DM-22277, DM-28394, DM-24638 
GD: Will this work on both HSC and DC2, and are those tickets prerequisites for DP0.2? Y: Yes
FM: The IDF DP0.1 came up this morning. The products looked like what I would expect based on the DPDD, are there products that have gone through SDM standardization I can look at.
HC: The data products ingested in qserv should be post-SDM.
YA: The SDM standardization gets run during the monthly reprocessing.
FM: To check, these take Fits files as input, and output parquet tables? Y: Yes
GD: Are the monthly reprocessing runs done with Gen 3 or Gen 2?
YA: The Gen 2 runs are what we look at for feature regressions, but we run Gen 3 as well. Gen 3 is not at feature parity yet, but starting March 1st we will regularly compare both Gen 2 and Gen 3, and once they match we will turn off the Gen 2 reprocessing.
YA: From October 2019 meeting, remaining next steps:
Still need to convert functors to registry
ap_pipe is already DPDD-ified (now called apdb)
GB: In Gen 3 is the unit of work (quantum) per patch? Y: Yes
CM: How reusable is this?
YA: Only one of the SDM standardization tasks has been converted to Gen 3 (Forced sources)
CM: The final output of ap must be in the right format.
EB: Is AP still blocked?
CM: Too soon to say, the existing Gen 3 task is a good place to start, we can insert a pipeline step to dump things to parquet. Overall Gen 3 makes the data handling much easier, we can just use the Butler.
EB: Question for FM: In production we will have a big long-running apdb, what connection do you see between diasource.yaml and these databases? i.e. these yaml files are the single source of truth for our data products, so I would expect that our databases would be built from that.
FM: I think that is a more useful model for the catalog databases, rather than apdb. Apdb has such performance constraints, and Andy is implementing it in Cassandra, so it will need to be hand-tuned
GD: All the AP data needs to flow through the apdb, so what is special there in Cassandra is which columns are indexed? FM: Yes
GD: Do you need to tweak the non-indexed columns in Cassandra? FM: No
FM: Each data product that comes in looks different than the same one a few months before. 
GD: Where in AP does the logical equivalent of the standardization happen?
CM: Because we have to interact with the database both spatially and temporally, the standardization happens after the image characterization step, and everything after that maintains the standardized format.
CM: The interface we have now ignores new columns that have been added (e.g. in afw) that are not in the schema
CM: In terms of data output, compared to DRP we are locked because we need to know what the database columns are as soon as it is fired up.


2Current status of the AP and DRP data productsSchemas for DP0: https://github.com/lsst/sdm_schemas


3Data Facility requirements

FM: I need to keep an eye on the SDM changes to make sure it doesn't change our sizing model.

FM: We may want to talk with Andy Salnikov to make the apdb more dynamic.

CM: For dax apdb we can get rid of all of the afw tables since we are using parquet

GD: If we go back to the original discussion, I am still puzzled about what we are doing with these other tables. There are three flavors of image tables: tables from the original schema, which are very Rubin-specific and predates any discussion of standards. Then there are tables in  Obscore (question) format based on community standards, and similarly CAOM. Then there is the Butler data registry.

GD: We can get Obscore from the Gen 3 butler data registry. The registry has more information in it than that.

YA: It's a good idea to go back to the white board. The Gen 3 registry will not contain everything that was in the "Jacek" data model

YA: With Gen 3 it has gotten easier to add data products.

CS: The Butler integration will require K-T, Jim, Tim to make sure the prompt and offline data products are compatible.

HC: We need to be able to verify that a parquet data file actually has the schema says it has.

HC: What we have done on qserv side is to add an extra step to make sure that data conforms to SDM, and force the values when they do not (mostly NAN values)

YA: The link from SDM to SQL schema is still very weak.

HC: FELIS is the format we write the parquet tables in.

GD: FELIS can translate between different formats for different telescopes.

YA: Another weak link is between the DPDD and the schema.

GD: The intent before is that there would be some token in the yaml that would indicate that a particular entry in the yaml was linked to a specific entry in the DPDD.

YA: We have a list of tickets to make changes to columns of the DPDD (DM-22078)

CS: The critical part is making the concrete schema aligned with what the data products we are producing are.

4Operations requirementsColin Slater

5Discussion

GD: We need to talk about the image metadata problem.

GD: It would be nice to cook up something and present it at the next meeting so we have something to talk about.

IS: Immediate action is for Yusra AlSayyad Colin Slaterand Gregory Dubois-Felsmannto meet and plan how to fix the image metadata.