Slack channels: #ops-rehearsal-3 #comcam-prompt-processing
About ops-rehearsal-3: Rehearsal script , Nightlogs , Nightly Monitoring Resources
APDB : `rubin@usdf-prompt-processing.slac.stanford.edu/lsst-devl` in schema pp_apdb_lsstcomcamsim
. See Accessing the APDB in the USDF.
Central butler repo /repo/embargo
- Daily chain:
LSSTComCamSim/prompt/output-<day_obs>
- Templates collection, as chained in
LSSTComCamSim/templates
, is u/homer/w_2024_12/DM-43439/20240323T142118Z
Calibration chain:
LSSTComCamSim/calib CHAINED
LSSTComCamSim/calib/DM-43441 CALIBRATION
LSSTComCamSim/calib/DM-43441/unbounded RUN
u/erykoff/DM-43224/abrought-bfk/bfk.20240308a CALIBRATION
LSSTComCamSim/calib/DM-42287 CALIBRATION
For LATISS, see Prompt Processing with AuxTel Imaging Survey Data 2024
day_obs of data collection | Tag of prompt_processing or prompt-service | Output collection in /repo/embargo LSSTComCamSim/prompt/output-<day_obs> | Notes | Summary of pipeline outputs |
---|
2024-04-04 | 2.5.0 (d_2024_04_04) (with DM-43674 and DM-43590) | LSSTComCamSim/prompt/output-2024-04-04 | Memory limit: 16GiB. 782 ops-rehearsal-3 nextVisit events 2 canceled 780 exposures USDF storage contention from DRP batch processing saturated the network until ~16:57 PT. PP request backlog cleared by 17:06 PDT. 218 timeout until ~17:11. - Header corruption issue
DM-43662
-
Getting issue details...
STATUS
. 41 fell back to ISR, 12 failed to do ISR
DM-43649
-
Getting issue details...
STATUS
- 258 timed out waiting for images, 218 of them were from before 17:11 PT. 10 from canceled exposures. Some
DM-39022
-
Getting issue details...
STATUS
.
- 472 SIGKILL
- 2 Postgres DeadlockDetected in diaPipe
DM-43783
-
Getting issue details...
STATUS
- ~497 premature pod shutdown, including 8 for the canceled groups.
- 152 connection refused and didn't make it to prompt service
- 1 connection reset by peer and didn't make it to prompt service
- 1 timed out dialing and didn't make it to prompt service
| - 41 ISR-only
DM-43662
-
Getting issue details...
STATUS
- 5606 successful ProcessCcd and diaPipe attempts
- 2 failed diaPipe for apdb deadlock
DM-43783
-
Getting issue details...
STATUS
- 5604 ApPipe results.
|
2024-04-03 | 2.4.0 (d_2024_03_29) (454de5c9) | LSSTComCamSim/prompt/output-2024-04-03 LSSTComCamSim/prompt/output-2024-04-03/ApPipe-noForced/prompt-proto-service-lsstcomcamsim-00024 LSSTComCamSim/prompt/output-2024-04-03/Isr/prompt-proto-service-lsstcomcamsim-00024
| Increased pods' memory limit from 8 GiB to 16 GiB 787 ops-rehearsal-3 nextVisit events 5 canceled 780 exposures. Some raws were transferred next morning. Or file notifications got lost due to saturated networks. 32 of those in auto-ingest. 6988/7020.
- USDF storage contention from DRP batch processing and no raw images were written in the first hour of the night (until ~17:30 PT), causing many timeout and a backlog of 390 when the storage issue was mitigated.
- Raw data transfer issue: not all raws were transferred during the night.
- Header corruption issue
DM-43662
-
Getting issue details...
STATUS
. 28 fell back to ISR, 18 failed to do ISR
DM-43649
-
Getting issue details...
STATUS
- ~461 premature pod shutdown
- 664 timed out waiting for image. 441 out of 664 were from before 17:45 PT. Some from canceled exposures. Some
DM-39022
-
Getting issue details...
STATUS
.
- 654 connection refused and didn't make it to prompt service
- 2 timed out dialing and didn't make it to prompt service
- 412 SIGKILL started ~18:50
- 1 broker communication failure
DM-43590
-
Getting issue details...
STATUS
- 1 Postgres DeadlockDetected in diaPipe
DM-43783
-
Getting issue details...
STATUS
| - 28 ISR-only
DM-43662
-
Getting issue details...
STATUS
- 4856 successful ProcessCcd and diaPipe attempts
- 2 of them failed diaPipe, 1 for the broker communication failure
DM-43590
-
Getting issue details...
STATUS
, 1 for apdb deadlock
DM-43783
-
Getting issue details...
STATUS
- 4854 ApPipe results.
|
2024-04-02 | 2.4.0 (d_2024_03_29) (454de5c9) | LSSTComCamSim/prompt/output-2024-04-02 LSSTComCamSim/prompt/output-2024-04-02/ApPipe-noForced/prompt-proto-service-lsstcomcamsim-00023 LSSTComCamSim/prompt/output-2024-04-02/Isr/prompt-proto-service-lsstcomcamsim-00023 | 782 ops-rehearsal-3 nextVisit events 4 canceled 7002 raws, some were transferred next morning
DM-43632
-
Getting issue details...
STATUS
- Failures due to missing header, header corruption: invalid character in the delivered header service data.
DM-43662
-
Getting issue details...
STATUS
130 fell back to ISR, 52 failed to do ISR
DM-43649
-
Getting issue details...
STATUS
11 broker communication failure
DM-43590
-
Getting issue details...
STATUS
- 8 of them failed before any processing. 3 had partial outputs.
338 Timed out waiting for images. Some image transfer issue
DM-43632
-
Getting issue details...
STATUS
. Some possibly arrived before pod ready
DM-39022
-
Getting issue details...
STATUS
. - 269 connection refused and didn't make it to prompt service
- 26 connection reset by peer and didn't make it to prompt service
- 1 timed out dialing and didn't make it to prompt service
- ~592 premature pod shutdown
- ~77 took too long in pipeline and hit 900s timeout.
DM-43666
-
Getting issue details...
STATUS
- 1089 SIGKILL. Some OOM, some others
| 4583 outputs / 7002 raws - 74 ISR-only
DM-43662
-
Getting issue details...
STATUS
- 4509 successful ProcessCcd and diaPipe attempts
- 3 failed diaPipe with broker communication failure
DM-43590
-
Getting issue details...
STATUS
- 4506 ApPipe results.
|