You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 46 Next »


The main objective here is to have the needed data for the Rubin Observatory Algorithms Workshop

The catch-all ticket is  DM-23243 - Getting issue details... STATUS .  Output repos will be inside /datasets/hsc/repo/rerun/DM-23243/ 


Input dataset: HSC PDR2 

1. What data products do we need for the Algorithms Workshop? 

Number of visits read from /datasets/hsc/repo/registry.sqlite3   (These are processed by singleFrameDriver)

field\filterHSC-GHSC-IHSC-I2HSC-RHSC-R2HSC-YHSC-ZNB0387NB0816NB0921Total
SSP_UDEEP_SXDS1831518
4653
3032

233

SSP_UDEEP_COSMOS56361042543212226
2550

777

SSP_DEEP_XMM_LSS/
SSP_DEEP_XMMS_LSS
3518
27
30522022

204

SSP_DEEP_ELAIS_N1762844432499142
3738

531

SSP_DEEP_DEEP2_34832647
75108284033

417

SSP_DEEP_COSMOS1034075327411116826
51

680

SSP_WIDE251991618631363135632073216


14440

SSP_AEGIS87
5
77


34
SSP Total

2863

1108

2097

1560

1497

3787

3972

74

154

204

17316
(UH) COSMOS 2190


6721

7206 199
(Ignore 7)
Total









17151


Number of visits that are used in coaddition: 2792 visits for DEEP+UDEEP; 11821 visits for WIDE.  (Only used those from NAOJ's tract-visits list)

Tract list copied from the HSC release page, the table of "database records":

UDEEP+DEEPFiltersTracts
COSMOSg,r,i,z,y,NB0387,NB0816,NB09219569-9572, 9812-9814, 10054-10056
DEEP2-3g,r,i,z,y,NB0387,NB0816,NB09219219-9221, 9462-9465, 9706-9708
ELAIS-N1g,r,i,z,y,NB0816,NB092116984-16985, 17129-17131, 17270-17272, 17406-17407
SXDS+XMM-LSSg,r,i,z,y,NB0387,NB0816,NB09218282-8284, 8523-8525, 8765-8767
WIDEFiltersTracts
W01 (WIDE01H)g,r,i,z,y8994-8999, 9236-9242, 9479-9485, 9722-9728, 9964-9969
W02 (XMM)g,r,i,z,y8278-8286, 8519-8527, 8761-8769, 9003-9011, 9245-9253, 9488-9496, 9731-9739, 9973-9981, 10215-10223
W03 (GAMA09H)g,r,i,z,y9069-9092, 9312-9335, 9555-9578, 9797-9820, 10039-10051, 10053-10057, 10282-10293, 10296-10298
W04 (WIDE12H+GAMA15H)g,r,i,z,y9096-9136, 9338-9379, 9581-9622, 9824-9864, 10079-10084, 10101-10106, 10321-10326, 10343-10348
W05 (VVDS)g,r,i,z,y8984-8986, 9206-9233, 9448-9476, 9691-9719, 9933-9960, 10175-10195, 10417-10436, 10659-10677, 10899-10904, 10912-10917
W06 (HECTOMAP)g,r,i,z,y15808-15834, 15987-16012, 16162-16186
W07 (AEGIS)g,r,i,z,y16821-16822, 16972-16973

The tract IDs for which we have data products in the WIDE layer: tract_id_wide.txt


2. Stack versions, pipeline steps and configs:

To get this running asap, we are comfortable to use different versions for different steps this time.

These use the /software/lsstsw/stack_20191101 shared stack.

  • singleFrameDriver.py w_2020_05 default configs
  • skymap w_2020_05 default configs
  • jointcal w_2020_05 default configs
  • fgcm  w_2020_06 for buildStars, w_2020_06 + DM-23526 ticket branch for fit and outputProducts. 
  • skyCorrection w_2020_05 default configs
  • coadd w_2020_07    Use FGCM photometry: 

    config.makeCoaddTempExp.externalPhotoCalibName='fgcm' config.assembleCoadd.externalPhotoCalibName='fgcm' config.assembleCoadd.assembleStaticSkyModel.externalPhotoCalibName='fgcm'

The following use the new shared stack at /software/lsstsw/stack_20200220

  • multiband w_2020_08
  • validate_drp   matchedVisitMetrics.py  w_2020_08
  • validate_drp   validateDrp.py   TBD
  • pipe_analysis   w_2020_08 stack with qa_explorer at commit ab69304  and pipe_analysis at commit 09a7675.  Use fgcm PhotoCalib. 
  • post-processing  w_2020_08
  • forcedPhotCcd w_2020_08

Pipeline commands:  https://github.com/lsst-dm/s20-hsc-pdr2-reprocessing

Discussions:


  • can start with w_2020_03?  sfm difference betw 03 and 05: defects map larger in 05
  • w_2020_05 is not verified with RC2 yet. But is targeted for starting sfm. 
  • want jointcal for astrometry & fgcm for photometry.

    • jointcal udeep takes days.  Each filter can be on separate nodes. ~3 nodes 5days for the deepest tract.  Give it 14+ days of walltime for udeep.

    • To parallelize better, can run photometry and astrometry separately. e.g. run one with doPhotometry=False and the other with doAstrometry=False
  • There are long (>60s) and short (30s) exposures. All were processed in sfm.  Only long exposures should go to coadd.  All will be used in FGCM. Debatable for astrometry.  jointcal were already being run with only the long exposures when this was discussed (02/11/2020).  The team decided not to re-do jointcal astrometry.  Maybe in a new rerun we will include all exposures for jointcal and learn from that. 
  • Want to run validate_drp on all tracts? 
    • validate_drp on master today does not need coadd & multiband. It only needs sfm & jointcal outputs.
    • Jeff's ticket branch adds 4 new metrics.  Will use r-band as a reference.  All filters depend on r-band data. No other new data dependencies. Maybe only want the new metrics in a few patches.  (TBD)
    • If using the new metrics ticket branch, need to understand the new data flow of validateDrp.py
    • Only validateDrp.py needs the DM-22310 ticket.  matchedVisitMetrics.py can start with a weekly release. 
  • Want pipe_analysis too. Though lower priority than coadd.  Need DM-21052 merged.  visitAnalysis and compareVisitAnalysis are the two lowest priorities.  
  • For the QA dashboard test, expedite the XMM-LSS field for visitAnalysis, coaddAnalysis, matchVisits.py, post-processing 

3. Infrastructure: compute & disk space – Michelle B is aware and has it under control. 

  • 2018 reprocessing HSC-PDR1 (DM-13666):   9227.15 node-hour ; output repo ~123 TB
  • PDR2 is ~3 times bigger in raw inputs.
  • Michelle can get 20 more nodes
  • Hsin-Fang's idea is to have a reservation to create a new queue:  IHS-3422 - Getting issue details... STATUS
  • A scheduled maintenance happened on Feb 27 and lsst-dev* were rebooted.  Jobs on the worker nodes were not interrupted. Starting Feb 28 a rolling reboot is done on the worker nodes (DM-23690

4. Human resources from NCSA? 

  • Michelle is very happy to have Hsin-Fang coordinate, check for errors and that everything is running correctly,  but would like to keep Monika involved  doing the running to continue building up experience.
  • Michelle wants to try to include Felipe 


5. Waiting for: 

  • Paul's new calibration set.  Paul is copying into /scratch/pprice/CALIB-20200115  everything included.  There may be missing data?  The calib repo will be at /datasets/hsc/calib/20200115/
  • sky correction is waiting for sky frame calibration
  • NAOJ's tract-visits mapping list: Yusra will follow up  Tract-visit mapping: https://www.dropbox.com/s/f1kv05k5vqv42pv/visitsFormatted_s19a_20200131.lis?dl=0
  • About the above visit list: We don't have data with visit ID > 138618.  Do we simply ignore those new visits?  Yes
  • The above visit list also includes some UH cosmos data (not SSP).   Want to include them too
  • Need to replace transmission curves per RFC-656 before sfm. 
  • Want DM-23331  & RFC-668 & DM-23434 for fgcmcal


6. Job status and summary


DEEP & UDEEPWIDETotal node-hours
singleFrameDriver

2758.08

skymap
  •  slurm job ID: 229995
  •  slurm job ID: 229996
0.02
jointcal

3466.34

fgcmcal83.45
skyCorrection

369.50

coadd

3735.56

multiband
post-processing
forcedPhotCcd (low priority)

matchedVisitMetrics (validate_drp)

the new validateDrp.py?
visitAnalysis
CompareVisitAnalysis (low priority)
colorAnalysis
coaddAnalysis
matchVisits (qa_explorer)


7. Reproducible Pipelines Failures - singleFrameDriver

DEEP+UDEEP: 

301 CCDs failed in UDEEP and their data IDs are in fatals_id_udeep.txt    1730 CCDs failed in DEEP and their data IDs are in fatals_id_deep.txt
Among these 2031 reproducible failures:

  • 297 No matches to use for photocal
  • 221 RuntimeError: Unable to measure aperture correction
  • 28 RuntimeError: Unable to match sources
  • 67 No objects passed our cuts for consideration as psf stars
  • 1415 InvalidParameterError 'Only spatial variation (ndim == 2) is supported; saw 0'
  • 2 TaskError: Fit failed: median scatter on sky = [] arcsec > 10.000 config.maxScatterArcsec
  • 1 TypeError 'The metadata does not describe an AST object'

WIDE: 

1390 CCDs failed in WIDE. Their Ids are in fatals_id_wide.txt

  •  260 : InvalidParameterError: 'Only spatial variation (ndim == 2) is supported; saw 0'
  •  1 : RuntimeError: No good PSF candidates to pass to PSFEx
  •  839 : RuntimeError: No matches to use for photocal
  •  16 : RuntimeError: No objects passed our cuts for consideration as psf stars
  •  16 : RuntimeError: Unable to match sources
  •  4 : RuntimeError: Unable to measure aperture correction for required algorithm 'base_GaussianFlux': only 0 sources, but require at least 2.
  •  22 : RuntimeError: Unable to measure aperture correction for required algorithm 'base_GaussianFlux': only 1 sources, but require at least 2.
  •  10 : RuntimeError: Unable to measure aperture correction for required algorithm 'base_PsfFlux': only 0 sources, but require at least 2.
  •  35 : RuntimeError: Unable to measure aperture correction for required algorithm 'base_PsfFlux': only 1 sources, but require at least 2.
  •  10 : RuntimeError: Unable to measure aperture correction for required algorithm 'ext_photometryKron_KronFlux': only 0 sources, but require at least 2.
  •  22 : RuntimeError: Unable to measure aperture correction for required algorithm 'ext_photometryKron_KronFlux': only 1 sources, but require at least 2.
  •  6 : RuntimeError: Unable to measure aperture correction for required algorithm 'modelfit_CModel_dev': only 0 sources, but require at least 2.
  • 26 : RuntimeError: Unable to measure aperture correction for required algorithm 'modelfit_CModel_dev': only 1 sources, but require at least 2.
  • 6 : RuntimeError: Unable to measure aperture correction for required algorithm 'modelfit_CModel_exp': only 0 sources, but require at least 2.
  • 26 : RuntimeError: Unable to measure aperture correction for required algorithm 'modelfit_CModel_exp': only 1 sources, but require at least 2.
  • 10 : RuntimeError: Unable to measure aperture correction for required algorithm 'modelfit_CModel_initial': only 0 sources, but require at least 2.
  • 37 : RuntimeError: Unable to measure aperture correction for required algorithm 'modelfit_CModel_initial': only 1 sources, but require at least 2.
  • 7 : RuntimeError: Unable to measure aperture correction for required algorithm 'modelfit_CModel': only 0 sources, but require at least 2.
  • 35 : RuntimeError: Unable to measure aperture correction for required algorithm 'modelfit_CModel': only 1 sources, but require at least 2.
  • 2 : ValueError: cannot convert float NaN to integer


8. Reproducible Pipelines Errors - Jointcal 

Seeing some   ERROR: Potentially bad fit: High chi-squared/ndof.  Data IDs are attached in DM-23323 and DM-23395.

(Maybe only in tract with few visits??) 


9. Reproducible Pipelines Failures - skyCorrection 

visit=137268 and 137288 failed with error "No good pixels in image array"; only 1 and 2 calexps exist for these visits; DM-23551  is filed; 

Both visits are 30s exposures in NB0387 from 2018-01-14; for continuing the reprocessing campaign, they are not needed in the coadd. 


10. FGCM 

fgcm_photoCalib products were not written for some visits. See DM-23394 and DM-23698

In total 138 visits miss some fgcm_photoCalib products. Some visits miss fgcm_photoCalib for all CCDs and some for selected CCDs. 

The data IDs missing fgcm_photoCalib are

(DEEP+UDEEP) https://jira.lsstcorp.org/secure/attachment/42853/42853_fgcmNoPhoto_deep.txt 

(WIDE) https://jira.lsstcorp.org/secure/attachment/42854/42854_fgcmNoPhoto_wide.txt

The missing fgcm_photoCalib means no downstream data for those visits/ccds.


11. Reproducible Pipelines Errors -  coadd

Among many warnings some also mentioned errors:

  • "All pixels masked. Cannot estimate background"
  • "No PsfMatched warps were found to build the template coadd ...."   This happens when warp is made but psfMatchedWarp isn't.

See  DM-23602


12. Reproducible Pipelines Failures - matchedVisitMetrics (validate_drp) 

If a tract+filter only has one visit, the task can't work:  DM-23581   So we don't run those cases. 


For WIDE, 15 failed with "FATAL: Failed: `ydata` must not be empty".

For DEEP, 

  • 3 jobs failed with "cannot do a non-empty take from an empty axes" (DM-23981 ). 
  • 7 jobs failed with OOM on the cluster workers and seem to require >192G of memory. We decided to only include a subset of the visits for those. 

See DM-23654.

Also note that the output are not proper Butler rerun repos; the task isn't writing outputs using Butler. 


13. Reproducible Pipelines Errors - coaddAnalysis (pipe_analysis) 

  • "UnboundLocalError: local variable 'axes2' referenced before assignment" DM-23829
  • "RuntimeError: No good data points to plot for sample labelled: star"   DM-23894


14. Reproducible Pipelines Failures - others 











  • No labels