Slack channels:  #ops-rehearsal-3   #comcam-prompt-processing

About ops-rehearsal-3: Rehearsal scriptNightlogs , Nightly Monitoring Resources

APDB : `rubin@usdf-prompt-processing.slac.stanford.edu/lsst-devl` in schema pp_apdb_lsstcomcamsim . See Accessing the APDB in the USDF

Central butler repo /repo/embargo 

  • Daily chain: LSSTComCamSim/prompt/output-<day_obs>
  • Templates collection, as chained in LSSTComCamSim/templates, is  u/homer/w_2024_12/DM-43439/20240323T142118Z
  • Calibration chain:

    LSSTComCamSim/calib                             CHAINED    
      LSSTComCamSim/calib/DM-43441                  CALIBRATION
      LSSTComCamSim/calib/DM-43441/unbounded        RUN        
      u/erykoff/DM-43224/abrought-bfk/bfk.20240308a CALIBRATION
      LSSTComCamSim/calib/DM-42287                  CALIBRATION

For LATISS, see Prompt Processing with AuxTel Imaging Survey Data 2024 

day_obs of data collection

Tag of prompt_processing or prompt-service

Output collection in /repo/embargo

LSSTComCamSim/prompt/output-<day_obs>

NotesSummary of pipeline outputs
2024-04-04

2.5.0 (d_2024_04_04)

(with DM-43674 and DM-43590)

LSSTComCamSim/prompt/output-2024-04-04

Memory limit: 16GiB.

782 ops-rehearsal-3 nextVisit events

2 canceled

780 exposures

USDF storage contention from DRP batch processing saturated the network until ~16:57 PT.  PP request backlog cleared by 17:06 PDT.  218 timeout until ~17:11. 

  • Header corruption issue DM-43662 - Getting issue details... STATUS . 41 fell back to ISR, 12 failed to do ISR DM-43649 - Getting issue details... STATUS
  • 258 timed out waiting for images, 218 of them were from before 17:11 PT.   10 from canceled exposures. Some DM-39022 - Getting issue details... STATUS .
  • 472 SIGKILL
  • 2 Postgres DeadlockDetected in diaPipe DM-43783 - Getting issue details... STATUS
  • ~497 premature pod shutdown, including 8 for the canceled groups.
  • 152 connection refused and didn't make it to prompt service
  • 1 connection reset by peer and didn't make it to prompt service 
  • 1 timed out dialing and didn't make it to prompt service 


5647 outputs / 7020 raws

  • 41 ISR-only DM-43662 - Getting issue details... STATUS
  • 5606 successful ProcessCcd and diaPipe attempts  
    • 2 failed diaPipe for apdb deadlock DM-43783 - Getting issue details... STATUS
    • 5604 ApPipe results.
2024-04-03

2.4.0 (d_2024_03_29) 
(454de5c9)

LSSTComCamSim/prompt/output-2024-04-03

LSSTComCamSim/prompt/output-2024-04-03/ApPipe-noForced/prompt-proto-service-lsstcomcamsim-00024

LSSTComCamSim/prompt/output-2024-04-03/Isr/prompt-proto-service-lsstcomcamsim-00024


Increased pods' memory limit from 8 GiB to 16 GiB

787 ops-rehearsal-3 nextVisit events

5 canceled

780 exposures. Some raws were transferred next morning. Or file notifications got lost due to saturated networks.  32 of those in auto-ingest.  6988/7020. 


  • USDF storage contention from DRP batch processing and no raw images were written in the first hour of the night (until ~17:30 PT), causing many timeout and  a backlog of 390 when the storage issue was mitigated.  
  • Raw data transfer issue: not all raws were transferred during the night. 
  • Header corruption issue DM-43662 - Getting issue details... STATUS . 28 fell back to ISR, 18 failed to do ISR DM-43649 - Getting issue details... STATUS
  • ~461 premature pod shutdown 
  • 664 timed out waiting for image. 441 out of 664 were from before 17:45 PT. Some from canceled exposures. Some DM-39022 - Getting issue details... STATUS .
  • 654 connection refused and didn't make it to prompt service
  • 2 timed out dialing and didn't make it to prompt service 
  • 412 SIGKILL started ~18:50
  • 1 broker communication failure DM-43590 - Getting issue details... STATUS  
  • 1 Postgres DeadlockDetected in diaPipe DM-43783 - Getting issue details... STATUS

4884 outputs / 7020 raws

  • 28 ISR-only DM-43662 - Getting issue details... STATUS
  • 4856 successful ProcessCcd and diaPipe attempts 
    • 2 of them failed diaPipe, 1 for the broker communication failure DM-43590 - Getting issue details... STATUS , 1 for apdb deadlock DM-43783 - Getting issue details... STATUS
    • 4854 ApPipe results.
2024-04-02

2.4.0 (d_2024_03_29) 
(454de5c9)

LSSTComCamSim/prompt/output-2024-04-02

LSSTComCamSim/prompt/output-2024-04-02/ApPipe-noForced/prompt-proto-service-lsstcomcamsim-00023

LSSTComCamSim/prompt/output-2024-04-02/Isr/prompt-proto-service-lsstcomcamsim-00023

782 ops-rehearsal-3 nextVisit events
4 canceled

7002 raws, some were transferred next morning DM-43632 - Getting issue details... STATUS

 
  • Failures due to missing header, header corruption:  invalid character in the delivered header service data. DM-43662 - Getting issue details... STATUS 130 fell back to ISR, 52 failed to do ISR DM-43649 - Getting issue details... STATUS
  • 11 broker communication failure DM-43590 - Getting issue details... STATUS

    • 8 of them failed before any processing. 3 had partial outputs.


  • 338 Timed out waiting for images. Some image transfer issue DM-43632 - Getting issue details... STATUS . Some possibly arrived before pod ready  DM-39022 - Getting issue details... STATUS .

  • 269 connection refused and didn't make it to prompt service 
  • 26 connection reset by peer and didn't make it to prompt service 
  • 1 timed out dialing and didn't make it to prompt service 
  • ~592 premature pod shutdown
  • ~77 took too long in pipeline and hit 900s timeout. DM-43666 - Getting issue details... STATUS
  • 1089 SIGKILL. Some OOM, some others



4583 outputs / 7002 raws

  • 74 ISR-only DM-43662 - Getting issue details... STATUS
  • 4509 successful ProcessCcd and diaPipe attempts  
    • 3 failed diaPipe with broker communication failure DM-43590 - Getting issue details... STATUS
    • 4506 ApPipe results.
  • No labels