Mondays 12PT (3 - 3:50pm ET)
Yusra's Zoom: https://princeton.zoom.us/my/yusra
Attendees:
Yusra AlSayyad, Lee Kelvin, Erfan Nourbakhsh, Fred Moolekamp, Clare Saunders,Joshua Meyers, Orion Eiger, Robert Lupton, Eli Rykoff, Colin Slater, Lauren MacArthur , Hsin-Fang Chiang , Jim Bosch , Nate Lust
Regrets:
Agenda:
- Meeting recorder - Clare! (last 6 meetings were: Lee, Colin, Jim, Keith, Fred, Nate)
- Announcements
- None
- Review Action items from last month
- Yusra AlSayyad and Orion confirmed that it was w_2022_48 that was used in the success mystery from last time.
- Datasets types - parquet table as an astropy table instead of a dataframe - this gets propagated everywhere.
- Failures from w_2023_03 - prod will be slightly changing with Jim Bosch 's changes.
- Processing Status
- W07 - lots of problems but most do not have to do with pipelines
- issue with ip_diffim - fix needed on step4
- you can now run on a ticket branch (used DM-38209)
- Response from Yusra: While testing this branch, memory problem on ip_diffim main was found (as of two weeks ago), as in corruption. Probably introduced by DM-32406. Until this is fixed, we still need to run on a ticket branch.
- Not clear why DM-32406 would cause memory issues, maybe something upstream
- Eli: I apologize if this has to do with the pybind11 consolidation. Yusra thinks this was not the problem, because seg faults happened before Matthias's ticket. DM-32406 was the only new ticket merged to main since the previous successful run.
- everything else worked except logging errors
- done since last Tuesday, but dispatch has not been working since then.
- issue with ip_diffim - fix needed on step4
- W07 - lots of problems but most do not have to do with pipelines
- Review the w_2023_07/ DM-38042 rerun:
- We don't have metrics, but we do have plots.
- Recall that weekly 03 is the one where we didn't have the objects on the edges of the tracts.
- We will compare to w_2022_48, because that is the last good one since Jim's major pipeline changes.
- Lots more plots than in w48.
- Some stats are still missing on the two-histogram plots
- Some astrometry difference plots don't have the expected distribution (this is not new). Clare Saunders is going to look into this.
- Comparing resource usage between w03 and w07. There are some big differences that are probably tied to the w03 issues.
- step4 problems.
- finding plugin fixed on DM-38209 (but testing with main ip_diffim shows that there might be memory problems in w11: https://lsstc.slack.com/archives/C025SQLKV0X/p1678143478405449)
- Recall history:
- w12: psfex
- w16: piff (bad size residuals)
- w20: finalizeCharacterize (bad apcorr configs → bad stellar locus)
- w22: lanczos11 + apcorr configs (better stellar locus and size residuals approx equivalent to psfEx!
- w24: PIFF kernelSize to 25. new scarlet lite storage.
- w28: attempt at fixing
measure
failures: - DM-35722Getting issue details... STATUS - w32: First RC2 at SLAC. subtractImages compatibility mode on
- w36: subtractImages Compatibility mode off
- w40:
- w44:
- w48: 9697/7 succeeds, extra subtractImage failures gone
- w03:
- w11:
- Chronograf, plot-navigator review of _07
- w_2023_06 DC2:
- Orion - there were no errors at all. Jim says no jointcal (i.e. no tract based steps) means no problem.
- g band was lost in a previous rerun but is now back.
- stellar_locus_width_wPerp is way up
- Eli merged a change in how we compute aperture correction maps, but the stellar locus is mysterious
- Jim: Is the selection now including more things?
- Eli: There are not a huge number of stars that go into calculation.
- On the nightly you can see some big jumps in a few metrics
- One change was in AM2 g-band - this lines up with gbdes being turned on
- Stellar locus - seems to be tied to aperture.
- The stellar locus would be changed by the extendedness, which is affected by the change in aperture corrections.
- We will now have more fainter objects
- Second jump in stellar locus probably tied to gbdes
- Jim: Not panicking yet, but need to see what happens on the monthly rerun
- Didn't see this jump in the RC2 stellar locus plots - you can see some change on the plots, and you also see that the number of stars goes up – different selection effects.
- Jim and Robert: what can we do to mitigate the fact that our metrics are very sensitive to selection effects?
- What do we expect next time
- potential memory issues in ip_diffim after DM-32406
- I'm sorry DM-38209 didn't get it before w11. You'll need to run with a branch again
- FGCM now uses IsolatedStarAssociation instead of its own associator. This shouldn't cause any major changes, but there will be different random selection, and the tasks have changed.
- Orion is trying to get w11 running using cmtools, but not working yet. Don't know how to run with cmtools on a ticket branch.
- Yusra: longer than 4.5 weeks (runtime of w07) is too long. If it takes longer than two weeks because we are trying to figure out how to run, that is workable.
- Other notes:
Yusra: remember that you should be looking at plots in the areas that you are responsible for!
Robert: How far are we from just getting an alert from the metrics that we should look at the plots? - AOB:
Action Items
Description | Due date | Assignee | Task appears on |
---|---|---|---|
| 04 Sep 2020 | Sophie Reed | DRP Metrics Monitoring 2020-08-07 |
| DRP Metrics Monitoring 2023-06-26 | ||
| DRP Metrics Monitoring 2023-06-26 | ||
| DRP Metrics Monitoring 2022-10-31 | ||
| Yusra AlSayyad | DRP Metrics Monitoring 2021-06-14 | |
| Arun Kannawadi | DRP Metrics Monitoring 2021-04-19 | |
| Arun Kannawadi | DRP Metrics Monitoring 2021-03-01 | |
| Yusra AlSayyad | DRP Metrics Monitoring 2021-01-04 | |
| Jeffrey Carlin | DRP Metrics Monitoring 2021-01-04 |