...
- If SLURM/sacct says it fails, something must be wrong.
- If SLURM says it completes successfully, check if the output files are written. If not all output files are written, something may be wrong.
- I have some simple scripts at https://github.com/hsinfang/lsst-notes/blob/master/repo-scripts/walkButlerCalexp.py and https://github.com/hsinfang/lsst-notes/blob/master/repo-scripts/walkButlerCalexpwalkButlerCoadd.py
- Counting files on the filesystem is equivalent.
- The numbers of coadd/multiband output files were as in this comment https://jira.lsstcorp.org/browse/DM-14123?focusedCommentId=100481&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-100481
- grep the logs with keywords, such as FATAL and ERROR (in all cases)
- DM-15121 was a known (non-)error, but seemed to have disappear.
- Ignored all warnings for now; most have been known or tickets were filed, but it's probably time to go through all of them again.
- Can also grep "Finished processing". This may be pipeline dependent.
- If anything above isn't right, try to reproduce the error.
- If the error is reproducible and is a pipeline issue, file a ticket with the how-to-reproduce and notify Yusra AlSayyad's team.
...