This page is to collect all notes regarding dataset reprocessing using as-is Gen 2 tools, i.e. pipe_drivers/ctrl_pool.
Detailed explanations about setting up a Gen2 Butler repo and the driver tasks:
As-is status, mitigation procedures if anything goes wrong, and the science payload of the HSC reprocessing are detailed in:
Notes, including summary of computing resource usage, of the HSC PDR scale campaigns:
2017: S17B HSC PDR1 reprocessing
2018: S18 HSC PDR1 reprocessing
Periodic HSC RC scale reprocessing:
RC1: Reprocessing of the HSC RC dataset
RC2: Reprocessing of the HSC RC2 dataset
Example command lines to run HSC-RC2 on NCSA Verification Cluster: https://github.com/lsst-dm/gen2-hsc-rc2
Slack channel #dm-hsc-reprocessing
(might be useful; updates may be needed)
ImSim DC2 Run1.2i dataset at /datasets/DC2/repo/ . The Princeton team would like at least 1 tract to be reprocessed every month, but this is not established yet.
- DM-17421Getting issue details... STATUS
- DM-17555Getting issue details... STATUS
Example command lines are in the attached files at - DM-21056Getting issue details... STATUS
Verification Cluster at NCSA
https://developer.lsst.io/services/lsst-dev.html including information about the shared software stack at /software/
https://developer.lsst.io/services/verification.html
https://monitor-ncsa.lsst.org/ (e.g. can check memory usage)
Questions go to #dm-infrastructure
An incomplete list of things to check the jobs:
- If SLURM/sacct says it fails, something must be wrong.
- If SLURM says it completes successfully, check if the output files are written. If not all output files are written, something may be wrong.
- I have some simple scripts at https://github.com/hsinfang/lsst-notes/blob/master/repo-scripts/walkButlerCalexp.py and https://github.com/hsinfang/lsst-notes/blob/master/repo-scripts/walkButlerCoadd.py
- Counting files on the filesystem is equivalent.
- The numbers of coadd/multiband output files were as in this comment https://jira.lsstcorp.org/browse/DM-14123?focusedCommentId=100481&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-100481
- grep the logs with keywords, such as FATAL and ERROR (in all cases)
- DM-15121 was a known (non-)error, but seemed to have disappear.
- Ignored all warnings for now; most have been known or tickets were filed, but it's probably time to go through all of them again.
- Can also grep "Finished processing". This may be pipeline dependent.
- If anything above isn't right, try to reproduce the error.
- If the error is reproducible and is a pipeline issue, file a ticket with the how-to-reproduce and notify Yusra AlSayyad's team.