Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Add more low-level processing details

...

Low-level processing details

This section includes low-level details that may only be of interest to the operation team.

The first singleFrame job that contributed to the output data products started on April 20, the last multiband job finished on May 5. 

...

Each of the grouped singleFrameDriver used multiple nodes. A non-small percentage (~30%) of jobs failed at slurm failing to launch the job. This is after singleFrameDriver.py successfully submitted the job to slurm, the job waited in the queue for its turn, and the job started trying once it got its turn, but then the job failed to launch with a message about socket timeout. This failure wasn't restricted to one specific worker node.  

Jira
serverJIRA
serverId9da94fb6-5771-303d-a785-1b6c5ab0f2d2
keyDM-14181
 was filed for further investigations. All failures were resubmitted iteratively, with many failed again in the iterations, and eventually were all pushed though.  On Apr 24 morning, the verification cluster used a brief downtime and the LDAP timeout in sssd.conf was increased on verify nodes. Afterwards the socket timeout problems were no longer seen. 

The execution of skyCorrection.py is independent per visit. The visit grouping of singleFrameDriver was used, resulting in 157 slurm jobs. 

A sqlite3 file was made for each layer to store information of what tract/patch overlaps what ccds, checking through each calexp using skymap.findTractPatchList  and geom.convexHull features. Different from S17B HSC PDR1 reprocessing, only tracts that are in the PDR1 were included this time; the tract IDs can be found in https://hsc-release.mtk.nao.ac.jp/doc/index.php/database/ (except tract=9572) or the first table on the S17B HSC PDR1 reprocessing page.  A few additional tracts were processed in the first place but were manually cleaned up afterwards. 

mosaic.py and coaddDriver.py were run for each tract x filter combination, using all visits overlapping that tract in that filter for each layer, i.e. 69 jobs in UDEEP, 218 jobs in DEEP, and 455 jobs in WIDE. 1 node was used for each job.

multiBandDriver.py was then run for each tract.  There are 11 tracts in total in the UDEEP layer, so the starting plan was to run multiband in 11 jobs. In the first attempt, each used 4 nodes and 12 cores per node. Some jobs failed to launch due to 

Jira
serverJIRA
serverId9da94fb6-5771-303d-a785-1b6c5ab0f2d2
keyDM-14181
, before the sssd timeout window was updated on Apr 24 morning. Those jobs were resubmitted.   Some jobs failed because they went out of memory; they were tract=8523 and tract=9813. I then attempted to run them with 5 nodes and 6 cores each (without reusing the existing data).  tract=8523finished but tract=9813 went out of memory again. I continued  tract=9813 using the --reuse-outputs-from option, and then it completed. Therefore, 12 slurm jobs in total contributed to the output data products in UDEEP.

In the DEEP layer, there are 37 tracts in total. In the first attempt, 37 slurm jobs were submitted and each used 4 nodes and 12 cores. All completed except the job of tract=9463 went out of memory and failed. tract=9463 was then re-run using 5 nodes and 6 cores each (without reusing the existing data); it competed.  In the WIDE layer, there are 91 tracts in total. It was completed in 91 slurm jobs, using either 4 or 3 nodes per job, 12 cores per node.

In many cases, Hsin-Fang Chiang set a larger time limit or allowed more memory than strictly necessary; this was not optimal but helped minimizing manual intervention throughout the campaign, arguably improving the overall throughput.  

The campaign mainly used the 15 worker nodes under the slurm reservation (

Jira
serverJIRA
serverId9da94fb6-5771-303d-a785-1b6c5ab0f2d2
keyIHS-749
) but also used other workers outside of the reservation if they were idle.  Throughout the campaign, Hsin-Fang Chiang did not occupy the entire cluster and at least some worker nodes in the normal queue were available for or used by other users. The intention was to allow developers to bypass the production jobs for their small jobs (a few nodes) and in a short queue for large jobs. This balance was kept manually.  It would be nice to have a production queue doing this for me (and also have a more consistent balance).