This is a rerun of the  S20 HSC PDR2 Reprocessing but with v24, Gen3, at the USDF.

See main ticket for history:  DM-39132 - Getting issue details... STATUS

PDR2 v24 Error Characterization

Documentation of 2023 HSC PDR2 Campaign: http://rtn-063.lsst.io

DM workshow presentation: Data Facilities mini-workshop - 2023-06-16 (see the link for CM status and experience for step1,2,3)

Submit directories:

  • Step1 and step2a: /sdf/data/rubin/user/mccarthy/prod/cm_prod/submit/HSC/runs/PDR2
step2b and beyond: /sdf/data/rubin/shared/campaigns/PDR2/cm_prod/submit/HSC/runs/PDR2/v24.1.0_DM-39132

Summary Butler Collection: HSC/runs/PDR2/v24_1_0/DM-39132 (contains outputs from steps 1-7,farotract inclusive,
plus input ancillary collection (skymaps, calibration)) can be used for Analysis Plots, other V&V metric generation and evaluation.

Data and Collections

summary collection for PDR2:   HSC/runs/PDR2/v24_1_0/DM-39132  (contains outputs from step 3-7, inputs to step3 (which contains step1,2abcde outputs)), farotract outputs and ancillary inputs (which contains skymap, calib, mask collections).

faro on tracts (step8):   The subset of 'per-tract' pipetasks in the faro: list of pipetasks for DRP-RC2.yaml were run on the PDR2 outputs.  The 'per-visit' faro pipetasks were not run.

We split the 710 PDR2 tracts into 90 groups  running using  CM tools (recent checkout branch  DM-40200): we used the CM 'split_dict' functionality and the separate WIDE,DEEP,UDEEP tract lists from step3 to automatically generate the 90 groups of WIDE (groups of 10 tracts each) and DEEP/UDEEP tracts (groups of 2 tracts each).  

In order to get the qG and EXEC butler to build in < 24 hours it was necessary to give this query planning hint to the quantum graph builder as an option in the bps yaml file: extraQgraphOptions: --dataset-query-constraint objectTable_tract

In addition, a few pipetasks needed their requestMemory allocations increased (similar to that done for DP0.2), these tasks were:

 matchCatalogsPatchMultiBand:
    requestMemory: 120000
  matchCatalogsTract:
    requestMemory: 220000
  matchCatalogsTractGxsSNR5to80:
    requestMemory: 220000
  matchCatalogsTractMag17to21p5:
    requestMemory: 220000
  matchCatalogsTractStarsSNR5to80:
    requestMemory: 220000

Even with these increased memory limits, tract 9813 still failed some pipetasks/bands with out-of-memory error.  We didn't pursue this any further.

The new USDF PanDA was used (instead of the legacy CERN PanDA Doma).  The CERN IAM (copy-and-paste a web token) authenticator was used to gain access to the USDF PanDA IDDS page:

https://usdf-panda-bigmon.slac.stanford.edu:8443/idds/wfprogress/

The template for loading a panDA setup script into the job_000.sh scripts for each group launched was updated to load setup_panda_usdf.sh to access new USDF PanDA env variables.  CM tools may need further updates in its 'cm check' routines to poll the IDDS website for job status.

 Since the default DRP.yaml pipetask definitions were not split out by faro-visit and faro-tract for the v24 stack (Orion has done this for the weekly  with later stacks) we made our own version of DRP-RC2.yaml that just had the faro_tract steps defined and referenced that in the bps submit yaml files.

group 1 needed a rescue, and groups 2, 15, 16 were rerun from the start (to recover from an out-of-memory and/or panDA glitch) since a rescue failed to build a quantum graph with the 'query plan hint' in place. 

group87 contained the two very large UDEEP tracts 9812, 9813.  This was allowed to run, but failed in several places due to out-of-memory, mostly on 9813.   group91 was created for tract 9813 alone and rerun with larger memory, however memory was still exceeded for some metrics.  No attempt was made to pursue completing 9813 futher.

group 60 was empty (similar to groups 119-122 being empty in step3) this is related to how tract sets are defined as 'supersets' of all PDR2 exposures.  This is currently expected. 

group90 was created to rerun 9812, but not rerun, as group 87 covered it.

Most metric calculators failed on the narrow band (N38, N816, N912) observations. No attempt was made to recover these.

Not every metric was calculated for every band, but most were for the (g,r,i,z,y) band data in the 710 tracts.

A collection was made to gather all farotract outputs and can be used for further analysis: HSC/runs/PDR2/v24_1_0/DM-39132/farotract

step7 summary collection: This task was run following step3, however once steps 4,5, and 6 were done, the output collections from steps 4,5 and 6 plus the ancillary input collection (with skymap,etc) was added to the step7 collection.  Therefore: The final output collection for PDR2, with everything in it, is called:   HSC/runs/PDR2/v24_1_0/DM-39132/step7 and this collection may be used for further V&V plots, as well as input to faro/step8/and other validation scripts. Final PDR2 collection generated on Aug 30, 2023.

step6: This step had 21 groups based of about 720 visits each. It ran only one task consolidateDiaSourceTable and ran without any errors, 10 minutes to make each qG, and less than 30 min to process each group.

step5: DM-40536 - Getting issue details... STATUS This step had 94 tract based groups, with about 7 tracts in each group.  As WIDE and UDEEP tracts were treated equally, one group,group 50 contained several UDEEP tracts (9812,9813,9814+)  and requestMemory needed to be increased on consolidateFullDiaObjectTable (increase them from 10GB to 100GB), transformForcedSourceOnDiaObjectTable (from 55GB to 255GB) from transformForcedSourceTable (from 7GB to 70GB)  consolidateAssocDiaSourceTable (from 5GB to 50GB) for this group 50 only.  It was confirmed that all 710 tracts were processed and had outputs.  Note that 3 groups (0,80 and 81) had no quantum graph generated due to no overlap between 'predicted tract list' and 'actual tract list'.  This was behavior expected (and seen in groups 119-122 of step3 as well).  When the CM 'collect' process was run, a hand-edit needed to be made to remove these three collections from the collect_job_000.sh script.

step4: DM-40350 - Getting issue details... STATUS This step had 41 visit based groups, with about 350 visits per group. Two errors kept recurring, involving the PSF. Similar errors had been seen during step1.  Memory was increased for a couple of pipetasks after testing.

step4 prepare: step4 works on a per-visit basis to perform the first part of the DIA (Difference Image Analysis).  The 14363 visits surviving through step3 (after visit veto removed about 2.6K visits for poor quality), are divided up into 40 groups of about 400 visits each. A special extraQgraph option is specified to improve quantum graph query plan performance.  qGraph and EXEC butler create in about 1.3 hours/group. The group of 400 visits runs in about 5 hours on 2000 cores typically in the 4GB-8GB queue (could experiment running in the < 4GB memory queue as very few exceed that limit we believe).  Estimate all of step4 will take about 1 week wallclock to finish once go ahead to start it is given.

step7 preliminary run: using the fixeups branch of healsparse in the lsstdesc github repo as a custom_lsst_setup with cm, the PDR2 HSC full footprint maps were calculated. The collection containing these maps is: HSC/runs/PDR2/v24_1_0/DM-39132/step7  There are 8 sets of maps, one for each band:

(g,r,i,z,y,N387,N816,N921) (the narrow bands have relatively sparse coverage).

step3 healSparsePropertyMaps: using the fixeups branch of healsparse in the lsstdesc github repo as a custom_lsst_setup with cm, 19 tracts were processed to complete per-tract per-band healSparsePropertyMap generation.

step3 cleanup:  step3 output collection has been updated to include the reruns of 17 tracts which were missing patches previously, including tract 9812, 9813 i,z band, and some other UDEEP and WIDE tracts.  These updated collections have been prepended to the step3 output collection,so if one selects with a --find-first, it should find only the updated patch information.  The list of tracts is described here:

We've completed the (re) running of the 13 dynamic sky tracts (8766,
8985, 9089, 9090, 9571, 9812, 9813, 9981, 10192, 15808, 15815, 16984,
17272) with meas_algorithm updates, the additional tract 9076, which was
 found to have had a 'production system glitch which caused detection to fail'
and 3 tracts which didn't produce objectTable_tract outputs due to a
thresholding failure (9225,9604,10659) in coaddition.


These 17 tract output collections were prepended to the step3 output run:

HSC/runs/PDR2/v24_1_0/DM-39132/step3

This run now has all 710 objectTable_tract entries, which can be listed by a line like this:

butler query-datasets /repo/main --find-first --collections HSC/runs/PDR2/v24_1_0/DM-39132/step3 objectTable_tract > ~/otts

Please
 note the --find-first option switch included here.  If one leaves this
switch out, then 14 tracts are included twice.  (the 13 from the dynamic
 sky rerun list above plus 9076).  Since the latest runs of detection
and measure were prepended to the full step3 output collection, if one
uses --find-first, then one will get the correct version of outputs for
these 14 tracts.

If quality plots have already been made for the above 17 tracts, they
could be remade at this time perhaps with the updated outputs, perhaps
after more checking of step3 is confirmed.Note that the skymap and other calibration info is not included in the

large step3 output collection, rather that remains in a separate collection: HSC/runs/PDR2/v24_1_0/DM-39132/ancillary

One could make a short two entry chained collection with the step3 output and ancillary stuff together if that makes life easier.

We have not done anything the more with the 3 patches that noted in the error listings as 'assemble coadd exited early -- no temporary patches found'.

The healsparsePropertyMaps also have not be updated for the ~25 failures when there's nothing there.

step3 Q/A checks: Step 3 ran thru  July 22, 2023 (approx 3 weeks wallclock after starting).  The output tagged collection is: HSC/runs/PDR2/v24_1_0/DM-39132/step3 with 707 tracts  complete. More details in the DM-39815 JIRA ticket and error characterization pages.  Memory needed to be increased for at least some quanta for several pipetasks, runtime limits also had to be increased in at least one case.  More details to be added here.  It was noted that for about 0.5% of patches (1K/180K), that no object detections were made due to not enough good  dynamic sky measurements being available.  This may lead to the need to rerun these patches.  Analysis continues.

step3 start: DM-39815 - Getting issue details... STATUS Step 3 is schedule to start July 5, 2023.  Step 3 is a 'coadd' step where calexps (calibrated visits) that overlap the same tract on the sky are combined together.  There are 17315 calexps in WIDE, DEEP and UDEEP depths, covering (partially or completely) 739 (approximately 2ish square-degree) tracts on the sky.  The UDEEP tracts contain up to 1800+ calexps to be co-added in 5 bands (grizy), including, for example 426 z-band calexps overlapping tract 9813.  There are 739 tracts in the current PDR2 outputs covered.  This agrees (1-1 we think) with the list of 39 DEEP+UDEEP and 700 WIDE tracts done in 2020 (S20 HSC PDR2 Reprocessing).  It is expected that wide tracts will take 24  hours to process and udeep may take up to 48 hours to process.  It is suggested to start with the UDEEP and DEEP tracts, perhaps putting only one tract in a group, and then for the 700 WIDE tracts, they could be grouped into 70 groups of 10 tracts each or so, depending on memory and wallclock timings.

step2: DM-39392 - Getting issue details... STATUS

step2e:  This is a global step which gathers all ccd/visits and visits into summary tables.  The outputs of step2e are not directly used by step3.  This step, due to a known 'PSF' expansion error when reading in inputs, required a large increase in requestMemory for the two pipetasks of this step.  Ultimately 480GB of RAM (close to the limit of 512GB available per node) was used, and the processing took about 7 hours wallclock to generate the output tables.

step2d: Applies calibrations from step2c to visit/source catalogs of step2a.  This step must be carefully clustered as 2/5 pipetasks operate on a per-detector basis. These pipetasks (writeRecalibratedSourceTable, transformSourceTable) can be 'clustered' on their own on a visit basis to enable all 103 detectors to be processed as a cluster (one job in panda).  This feeds 103 long uuid s to Panda, and approaches, but doesn't quite exceed the 4000 character limit for a panda input string setting.  This works for HSC, but will not work for 189 LSSTCam clustering, and so the 4000 character issue should be addressed before then. An attempt to divide the 103 detectors into two sets (d < 52 and d>=52) failed the quantum graph build or running in other areas/steps.  Memory also needed to be increased considerably for several steps.  See requestMemory and clustering_step2d_both yamls in: https://github.com/lsst-dm/cm_prod/tree/tickets/DM-39392/src/lsst/cm/prod/configs/PDR2 for more hints.  Eventually took about 3 days wallclock to get through step2d successfully.

step2c: fgcm photometric calibration.  A 'global' step, as it simultaneously fits selected stars from all tracts/bands together to obtain uniform solution across whole dataset footprint.   Three pipetasks: BuildTable (10 hours wallclock, 1 Core, 350GB RAM), fitCycle (16 hours, 12-32 cores, 350-416GB) , OutputTables (0.5 hours).  Note that future versions of BuildTable will run in parallel so 10 hour wall clock will be reduced.   fitCycle will need some modifications to handle larger datasets (17K exposures, 739 tracts) than PDR2 within the 512 GB/node computing hardware available.  Modifications to PanDA system made to enable multi-core usage (for the first time).

  • Note that the FGCM lookup table needed for the full PDR2 processing is in the collection "HSC/fgcmcal/lut/PDR2/DM-39549".  This replaces the RC2 FGCM lookup table.
  • Note that the current fgcm code drops Q/A plots into the worker node area, but these are not  currently saved in a butler structure, so that will generally disappear upon job completion.  This should be addressed in future code updates.

step2b:  Runs (older) jointcal astrometric calibration on calexp outputs of step1.

                Processing hit out-of-memory errors, which could be alleviated in the future by grouping by band and tract range. Ultimately took about 1 week wallclock for 17315 exposures over 739 tracts.

  • inputs (to use for reproducing) 

    • HSC/raw/PDR2/v24_1_0/DM-39132/input,HSC/runs/PDR2/v24_1_0/DM-39132/ancillary
      Note that HSC/raw/PDR2/v24_1_0/DM-39132/input is a chain of the outputs of step1,step2a (WIDE/DEEP/UDEEP)
      and includes HSC/raw/PDR2/(WIDE/DEEP/UDEEP)
  • RUN outputs are formatted like: 

    • HSC/runs/PDR2/v24.1.0_DM-39132/step2b/group0/w00_000

    • Note that running rescue workflows would result in names like w01_000
    • Retrying bps submission of the same workflow (say because of lost connection to panda) would result in names like w00_001
  • Once the step completes, the RUN files will be collected into this CHAINED collection:

    • HSC/runs/PDR2/v24.1.0_DM-39132/step2b
      This CHAINED collection will include HSC/raw/PDR2/v24_1_0/DM-39132/input (but not HSC/runs/PDR2/v24_1_0/DM-39132/ancillary)

step2a: Gathers step1 sources (per detector catalogs) into per-visit catalogs. There are 103 science detectors/visit (0:8,10:103) detector 9 excluded due to quality of that chip.  2 days wallclock end-to-end, no errors.

Step1: DM-39133 - Getting issue details... STATUS Uses stack v24.1.0.rc2 on three HSC PDR2 exposure(visit) collections:

WIDE: 14473 exposures (note exposure 428 (y-band) from the 2020 processing, left out accidentlly).

DEEP: 1832 exposure

UDEEP: 1010 exposures (up to 426  exposures in one band covering a single tract on the sky: tract 9813, z-band)

Errors encountered: PDR2 v24 Error Characterization

17315 exposures divided into 39 groups for processing with PanDA/slurm workload/workflow/batch system on 3,000 cores at USDF

typically took 2.5 hours to generate quantum graph + EXEC butler per group

run time typically 2 hours (not counting long tail), when clustering of 5 pipetasks of step1 in place.

long tails, timeout, hangs from processing extended run time to a couple weeks wallclock for all of step1.

processing runs in 4GB/core requestMemory allocation.







  • No labels