Unit, Integration & Large Scale Testing

Do we need any changes to the way we handle unit tests?
- How is our test coverage right now?
- There exist many JIRA tickets for improving unit tests, from work to allow automatic coverage reporting (e.g. DM-11725) to improve specific packages (e.g. DM-9212, DM-10792, DM-79, DM-14295).
- Packages in v14 without "tests/": pipe_drivers, ctrl_pool, verify_metrics, display_ds9
- Broken codes in the Stack. Codes in package/examples are not tested. Not every executable in package/bin.src/ are tested. Do we care some are not in working conditions or have no docs? Should we work all of them into the QC or Jenkins?
- Will we test Jupyter notebooks? (Promising technology does exist! DM-13064)
- Some of our Stack unit tests are conflated to do integration-ish tests. Do we need to reorganize some unit tests into integration tests?

JDS comments

Thanks Hsin-Fang Chiang for the notes above. A few thoughts of my own:

I don't think we can sensibly imagine a wholesale rethinking of test standards in the scope of this WG.
I do think that everything we distribute should be tested. That includes executables, examples, and notebooks.
- There's a technology question here, of course, in that we don't currently have an easy way of testing notebooks.
- The notebook question also relates to documentation: do we actually want to distribute notebooks at all? If Jonathan Sick proposes an alternative documentation mechanism, that might sidestep the question entirely.
- I think it would be fair to require that all executables, examples, and documentation containing code should be tested, and for the WG to request that the technology to make that possible be prioritised.
I don't think we can usefully put a requirement on coverage on non-(executable, example, doc) code.
- In particular, I don't think it's practical to schedule lots of effort to increase coverage on old code.
- However, we should be tracking coverage, and perhaps expecting reviewers to verify that new code does not decrease coverage when it's added.
...in fact, idea based on the above is that we should provide an explicit list of things that reviewers are expected to check for.
We should document explicitly how we expect broken examples, etc, to be handled when they are discovered.
- My inclination is to remove them (simply git rm) and file a ticket to fix them up someday, rather than trying to block ongoing work.

How are datasets made available to developers? Git LFS repositories?
- Git LFS is capable of storing 100 GB objects, and the AWS S3 limit is 5TB. Some of our largest existing Git LFS repositories are: ap_verify_hits2015 ~204G, validation_data_hsc ~618G.
- Small test datasets → Git LFS. Large test datasets → /datasets GPFS space
- Regarding /datasets, the current policy is at https://developer.lsst.io/services/datasets.html Usability varies between different repositories; not all repositories have a person/team actively checking/maintaining its usability. A few difficulties includes (1) some obs packages are in the early phase of development, so their respective data repositories are also in experimental phase (2) in Gen1/2 Butler, repositories have intrinsic versions tied with a Stack version; a once-working repository can become outdated with the Stack changes (3) it’s not always clear if data added by ex-team members still have use cases (4) multiple RFCs related to /datasets have been adopted but not implemented (5) maintaining the usability requires coordinated effort from multiple teams and according to the current policy, “responsibility for maintaining usable datasets is a DM-wide effort“ (6) historically this does not have high priority, especially switching to Gen 3 Butler will change everything. (7) Obs-WG may change everything too.

- - Existing repositories on /datasets
    - auxTel , comCam , ctio0m9 , lsstCam :
      These four belong to Merlin and the CPP team. I’m not sure if a non-CPP developer would be able to use these repos easily. My concern of (1) above.
    - gapon : For qserv use
      sdss : Partially (‘dr7‘) used by qserv work; partially (‘dr9’) unclear. The ‘dr7’ folder contains some old butler repositories in a different organization than the current policy (RFC-249). I wonder if it was because those data predate the “rerun” feature in the Stack.
    - refcats : Not sure whose responsibility this is, but it has been useful since it was provided. It’s a special butler folder, and is sym-linked into other regular camera butler repositories.
    - hsc : Maintained by DRP+LDF teams; regularly used in HSC reprocessing.
    - decam : Unclear use case and responsibility. Concerns (2) and (3) happened here. One solution is to delete all data here unless a use case is provided.
      des_sn : I assume this was meant to be temporary, and will be merged into the decam repo somehow?

- Similar maintenance concerns exist with the Git LFS datasets too.
- On github.com/lsst/ we have: afwdata, testdata_cfht, testdata_jointcal, testdata_decam, testdata_subaru, testdata_deblender, testdata_lsstSim, qserv_testdata, ap_verify_hits2015, Validation_data_decam, validation_data_hsc, validation_data_cfht. Are they all eups packages? Some are required in lsst_ci. Some are not CI-ed with the Stack. Some are not versioned with the Stack. How user friendly are these packages to developers?
- What are the roles of validation_data_* for the raw and the processed data? (relevant tickets e.g. DM-5147, DM-5381) Manual updates are required for the processed data (e.g. DM-13204). How frequent do developers want them updated? Other maintenances, such as DM-13376, are needed occasionally.

Only on the verification cluster? Where does a developer who just wants “some data” go? (This covers how datasets are managed, not what the contents of those datasets should be).
- /datasets is readable via the JupyterLab environment.
- LDF can offer public web access, as long as there are no data right issues. However I’m not sure how useful that is to developers with the Butler as-is. Gen 3 will make such access more useful, as required by DMS-MWBT-REQ-0040 (Remote Input DataRepository) in DM Middleware Requirement LDM-556. Eventually, there will be LSP Web API Aspect.
- In Gen 1/2 Butler, getting partial data from a remote data repository to a developer’s laptop is not trivial. A how-to example is at this comment This will be easier in Gen 3. Relevant features are in Requirements DMS-MWBT-REQ-0010, DMS-MWBT-REQ-0011 (Subsetting a DataRepository), DMS-MWBT-REQ-0016 (Intermediate outputs of Data Release Production [test] processing shall be usable as inputs for test/development processing on external hardware.)

JDS Comments

Given all the above, one might question whether we need to provide data through Git LFS at all.

Or perhaps more concretely: there seems to be a clear requirement for datasets needed to run unit tests to be available through LFS. (So that's afwdata, maybe testdata_foo). But is there any advantage to having validation_data_bar packaged in LFS? Having a developer simply able to ask their Butler to fetch data they need from the LDF would be convenient, and would mean we don't need to curate datasets which live in more than one location.

Reading further, I guess I'm wondering if the “small” datasets in the table below should simply live on the VC rather than in LFS.

Per discussion of 2018-05-25:

We like having a per-dataset owner.
But we worry that the fact that we keep all data for a camera in a given dataset makes this hard.

After the Obs Pkg WG has done its thing, we will “never“ (ish) want to change obs packages; we don't expect that there needs to be a lot of churn in these datasets.

Do we need to reprocess processed data in response to stack changes?

Only HSC has processed data.
And we probably only care about the most recent version of that.

We wonder about how ComCam/AuxTel/LsstCam data will be managed — will it be on the DBB or in datasets? How different will they be in practice? Hsin-Fang Chiang will chat to Michelles about how they see this.

Action on John Swinbank — look at the existing git LFS datasets, see if the AP verify dataset model applies to them, if we can define a general structure.
John Swinbank to talk to Michael about processed data in datasets.

HFC Comments

I think there are advantages of having at least some of the "small" datasets in Git LFS. ci_hsc has proven to be very useful and it’d be sad to see it go without a better replacement. Even with a replacement, the input dataset may still make sense to live in Git LFS rather than in just VC. ci_hsc is also a great dataset for tutorials, therefore having it public is nice.

So, to me it comes down to:

Because Jenkins uses it. (p.s. any mention of Jenkins on the page only means Jenkins as in how DM deploys it today without large infrastructure/architecture changes.)
Because we want them public.

Fetching from LDF via Butler will be supported eventually for LSST data, but not in the short term, and not sure if we'll do so for non-LSST data or without authorization.

I also realized the “small” dataset definition probably spreads too wide, and most repositories are actually below 10 GB (ci_hsc 8.2G, validation_data_cfht 4.2G, validation_data_decam 5.9G) so are fine LFS size. Maybe another threshold is whether Jenkins can handle it comfortably (and is it CPU time, parallelization, or more intelligence needed to pick the "right" set of data?). ci_hsc seems to be around Jenkin's upper bound?

What's an appropriate cadence for running small/medium/large scale
- Let's define small/medium/large first → see the table below
- Not answering the question, the as-is service by LDF is reprocessing HSC-RC2 once every two weeks, and HSC-PDR1 per request from the DRP team.
- About cadence, besides the resources needed for each run, we should also consider how fast the QA team can digest through the processed data.

JDS Comments

Should the canonical set of metrics being tracked in SQuaSH and used for system verification be based on the reruns at the LDF, rather than on validation data packages?

What are the “canonical set of metrics”? We should return to that below.

HFC Comments

Re: Question 1, work in progress is at DM-14328. Although I'm not convinced we really need such a large scale for most metrics tracking. Maybe metrics tracking can be in multiple scales too, following the datasets?

integration tests and reprocessing of known data?
- Currently, automatic end-to-end integration tests of SciPi include (1) ci_hsc (2) validate_drp (3) ci_ctio0m9 (anything else automatic in this coverage or beyond?)
- ci_hsc and validate_drp are run in Jenkins, triggered by timers every night
  - lsst_distrib contains validate_drp, and is run by timer nightly (Jenkins/science-pipelines/lsst_distrib)
  - However, validate_drp itself does not run processing; it contains scripts doing processing.
  - ci_hsc is run by timer nightly (jenkins/science-pipelines/ci_hsc)
  - ci_ctio0m9 is not run automatically, but is run as part of the "DEMO” in the developer-triggered Jenkins (jenkins/stack-os-matrix/).
  - lsst_ci runs two scripts from validate_drp: runCfhtQuickTest.sh and runDecamQuickTest.sh
  - Why isn't validation_data_hsc in lsst_ci, while validation_data_cfht and validation_data_decam are?
  - "lsst_ci" contains validate_drp and ci_ctio0m9, but not ci_hsc. lsst_ci is provided as a Jenkins build target. lsst_ci is appended to the package list unless SKIP_DEMO is checked. We also have "lsst_qa" but not really in use (?).
  - The big question of meta-packages is probably beyond the WG's scope. See RFC-305 and links therein. Now we do have a Release Manager, can we ask DMLT to reopen the issue?
  - On the developer side, there seems a need of clearer documentations on our Jenkins status. A frequent question from developers is "what does “Skip Demo” actually do?" on the developer-triggered Jenkins (jenkins/stack-os-matrix/).
- validate_drp runs pipelines up to processCcd with astrometry_net using CFHT and DECam data. It hasn't gone beyond pipelines requiring a skymap. (Is this correct?) (From DM-11501 and SQuaSH, DECam tests seem no longer run? How can a developer know what’s run?) (Should it run HSC data too? What is validation_data_hsc for? )
- ci_ctio0m9 runs up to processCcd too.
- ci_hsc includes more pipeline steps, essentially all DRP steps but without meas_mosaic and pipe_analysis.
- We need to improve the coverage of integration tests. I think at least all pipelines used in production-ish reprocessing (e.g. biweekly HSC-RC2) should be CI-ed. Codes that are run in the biweekly HSC-RC2 but not CI-ed include: meas_mosaic, pipe_analysis, validate_drp (part).
- AP team has a proposal to include AP pipelines into CI: DM-13970
- For the CI jobs run by timer, how is the team notified if anything breaks? How if a metric falls out of the criteria?
  - Breakage notifications: Developers subscribe to Slack channels #dmj-s_lsst_distrib #dmj-s_ci_hsc etc
- Do we need to define product owners, similar to test datasets?

JDS Comments

When does SQuaSH send alerts on metrics moving in the wrong direction? Can individual users subscribe to alerts on a particular combination of dataset/metric?

How scalable is SQuaSH? Is there a downside to using it for datasets on the scale of HSC RC2? PDR1? The whole LSST survey?

We should turn the list of what's getting run now (above) into a list of what needs to get run when!

We agreed that the new Release Manager is looking at meta-packages, but we also identify this as a key link to Jenkins jobs.

We probably want to eliminate ci_hsc and fold its functionality into an expanded validate_drp (ie, that goes through coadd processing etc).

I (John Swinbank) want to think more if I'm happy with Slack notifications of Jenkins failures. Hsin-Fang Chiang is probably happier than I am.

Some sort of “smoke test” that demonstrates that things like meas_mosaic and pipe_analysis can even run would be useful.

How is the system for tracking verification metrics (“KPMs”, if you must) managed? (Not in the sense of what SQuaSH does, but who is running the jobs to calculate verification metrics? How often? etc)

- Jenkins-run validate_drp jobs are done nightly and subsequently the metrics are published by SQuaSH
  - Note: this run including uploading metrics to SQuaSH is done in a different Jenkins group https://ci.lsst.codes/job/sqre/job/validate_drp/ than the science pipeline nightly jobs (https://ci.lsst.codes/job/science-pipelines/)
  - Only validation_data_cfht and validation_data_hsc are shown in SQuaSH. Are we not doing validation_data_decam? (Or just not included in the drop-down list here: https://squash.lsst.codes/dash/code_changes/ ?)
- Work to publish metrics of the biweekly HSC-RC2 runs into SQuaSH is in progress: DM-14328
- Task metadata, stored as output files, includes computing metrics such as timing and memory information on the task level. Does SQuaSH already extract it?
  - Task records via the python decorator pipe.base.timeMethod, utilizing resource.getrusage (code link to pipe_base)
- Specific pipeline codes can enable or disable task metadata files to be written. For example, the biweekly HSC-RC2 reprocessing is currently based on pipe_drivers and only partial task metadata files are written; see DM-12932. Some problems related to task metadata have been reported, e.g. DM-11175, DM-4927.
- On the job level: as right now the ctrl_pool + slurm are used for the reprocessing, some computing metrics are stored in the slurm database. Such computing metrics can be retrieved afterwards, to compile reports such as Node Utilization for HSC-RC2 Reprocessing Jobs
- As we move towards Gen 3 middleware and a production system, we should have a production-framework level records of such metrics. DMTR-51 Table 3 shows an example.
How should we monitor run-time performance?
- Task-level resource usage: see the third point about Task metadata in the last section.
- On a higher level, at https://monitor-ncsa.lsst.org “Batch Compute Systems Summary” dashboard, one can check the compute node load, the used memory, network traffic, etc, on any worker node. LDF is planning to improve this monitoring; if developers have in mind specific metrics or suggestions, it will help LDF prioritizing the deployment schedule.

JDS comments

I think I'd like to see this integrated with SQuaSH & CI, rather than being a separate system. When somebody makes a change, they should quickly be able to see if it has a running time impact, and we should issue alerts if running time is increasing.

This may mean running CI jobs on a controlled platform rather than VMs, I guess, but maybe other sources of uncertainty will dominate over that anyway.

Maybe just running ci_hsc in Jenkins is the best way to give developers fast feedback on algorithmic impacts of performance.

HFC Comments

Do you mean integrating SQuaSH & CI and so to know the impacts of computing resources based on a developer ticket branch (besides master)? Or are you thinking of some kind of consolidation between Grafana and SQuaSH?

Do we need to define additional datasets to represent a wider range of data quality and observing conditions?
- In the DM-11345 Description, there is a list of known data characteristics. Do we want to define test datasets for those challenging cases and run regularly?
- We note the SST provided list of datasets: http://ls.st/9mk

	Examples	Size (Raw)	Storage	reprocessing time of one run	reprocessing cadence (as-is)
tiny	afwdata, testdata_x	< 10 GB	Fit comfortably inside a Git LFS repository	< 1 hour on a single CPU	Used in unit tests; CI required for pre-merge
small (too broad?)	ci_hsc, validation_data_x, ap_verify_hits2015	10-999 GB	Fine in a Git LFS repository but transferring can take hours	1-10 hours on 1 core	CI every night
medium	HSC-RC2 (432 visits)	~10 TB (processed)	GPFS on LSST machines	100-1000 node-hours on lsstvc	every 2 weeks
large	HSC-PDR1 (5654 visits)	>100 TB (processed)	GPFS on LSST machines	>1000 node-hours on lsstvc	annually

Discussion with Simon, 2018-05-25

Do we need a different technology to run small datasets than to run large datasets?
- If we have two ways, then we should run all datasets in both ways.
- Currently, we use SCons (ci_hsc) & shell scripts (validate_foo).
- validate_drp scripts may be close to the “right thing“, but needs testing, etc.
- Should be possible (with a little work) to use the validate framework to drive SLURM.
Why does validate_drp currently only use processCcd.py?
- Just because nobody has time to take it further.
- Nobody has implemented computation of metrics which are based on further processing.
Datasets that we currently have are effectively arbitrary.
- Nobody has thought about which datasets are optimal for testing what.
Can we assume that SuperTask / next gen middleware does “everything”?
- We're not sure, but it might.
Why do we have processed data in validation data packages?
- We'd have to ask Michael.
- But probably so we can run validate_drp without having to reprocess the data every time.
- We should touch base with Michael to see what he thinks about this now.
- We could also automate this by using Jenkins (...or even cron)

Space shortcuts

Page tree

Discussion with Simon, 2018-05-25