Date: Thu, 28 Mar 2024 16:28:35 +0000 (UTC) Message-ID: <1779893860.20453.1711643320124@confluence> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_20452_151261963.1711643315099" ------=_Part_20452_151261963.1711643315099 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
Update in progress
See S1= 5 Multi-Band Coadd Processing Prototype for a proposed change to some o= f the logic described herein.
This page attempts to capture at a high level the software and algorithm= development necessary to implement the processing of objects detected at t= he full survey depth (at the time of a particular data release), including = the detection, deblending, and measurement of sources too faint to be detec= ted in any individual visit. The algorithms to be used here are gener= ally poorly understood; we have many options for extending well-understood = algorithms for processing single-epoch data to multi-epoch data, and consid= erable research is needed to find the right balance between computational a= nd scientific performance in doing so. Unfortunately, different algor= ithmic options may require vastly different parallelization and data flow, = so we cannot yet make assertions about even the high-level interfaces and s= tructure of the code. We do, however, have a good understanding of mo= st of the needed low-level algorithms, so our goal should be to implement t= hese as reusable components that will allow us to quickly explore different= algorithmic options. This will also require early access to parallel= ization interfaces, test data, and analysis tools that will be developed ou= tside the DRP algorithms team.
* We don't need Image Differencing outputs to start the Deep Proces= sing (e.g. we can probably do Image Coaddition first), and there may be som= e value to doing the DRP Image Differencing at the same time as some parts = of the Deep Processing (Deep Background Modeling, in particular).
The next few sections contain the various components of the deep process= ing, in rough order - the exact flow is very much TBD. &= nbsp;In fact, the lines drawn between several components are also somewhat = arbitrary; the distinctions between detection, peak association, deblending= , and measurement are based on a baseline algorithm that is derived largely= from the SDSS processing, which was considerably simplified by only needin= g to process a single epoch for every band. Algorithms significantly = different from this baseline would define these steps and the interfaces be= tween them differently. Assuming that baseline, or something close to= it:
We'll almost certainly need some sort of coadded image to detect faint s= ources, and do at least preliminary deblending and measurement. We'll= use at least most the same code to generate templates for Image Differenci= ng.
Because there's no single coadd that best represents all the data it sum= marizes, there are several different kinds of coadds we may want to produce= . It's best to think of coaddition as a major piece of the algorithms= below, but one which will be executed up front and reused by different sta= ges of the processing; while there will be algorithmic research involved in= determining which kinds of coadds we build, that research effort will be c= overed under the detection, deblending, and measurement sections, below.
All coadds will be produced by roughly the following steps:
The overall processing flow for all coadds will likely be the same (thou= gh the final processing flow may or may not match our current prototype), a= nd will share most of the same code.
For any of these coadd types, we may also want to coadd across bands, ei= ther to produce a =CF=872 coadd or a weighting that ma= tches a given source spectrum. It should be possible to create any mu= lti-band coadd by adding single-band coadds; we do not anticipate having to= go back to individual exposures to create a multi-band coadds.
We will also need different coadds for different sets of input images. &= nbsp;As discussed above, including images with broad PSFs in direct and PSF= -matched coadds degrades the image quality of the coadd even as it adds dep= th, so we will likely want to produce coadds of these types that are optimi= zed for image quality in addition to those that are optimized for depth. &n= bsp;We will also want to build coadds that correspond only to certain range= s in time, in order to detect (and possibly deblend and measure) moving sou= rces with intermediate speeds (i.e. offsets that are a small but nonnegligi= ble fraction of the PSF size between epochs) that are below the single-fram= e detection limit.
We have a relatively mature pipeline for direct coadds that should only = require minor modification to produce at least crude PSF-matched coadds. &n= bsp;We need to finish (fix/reimplement) the PSF-matching coadds, and do som= e testing to verify that they're working properly. We haven't put any= effort at all into likelihood coadds yet, and we need to completely reimpl= ement our approach to =CF=872 and other multi-band coa= dds.
Our propagation of mask pixels to coadds needs to be reviewed and = probably reconsidered in some respects.
Our stacker class is a clumsy and inflexible, and needs to be rewritten.= This is a prerequisite for really cleaning up the handling of masks.=
We do not yet have a settled policy on how to handle transient, variable= , or moving objects in coadds. Downstream processing would probably b= e simplified if transient and moving objects that can be detected at the si= ngle-epoch limit using Image Differencing were excluded entirely from the c= oadds, but some types of coadds do not handle rejecting masked pixels grace= fully (at least in theory; this level of masking may work acceptably in pra= ctice), and we may want to adopt some other policy if this proves difficult= . Similarly, it would probably be convenient if coadds were defined s= uch that variable sources have their mean flux on the coadd, but this may n= ot be worthwhile if the implementation proves difficult.
The coadd datasets are a bit of a mess, and need a redesign.
The data flow and parallelization is currently split into two separate t= asks to allow for different parallelization axes, and the stacking step is = parallelized at a coarser level than would be desirable for rapid developme= nt (though it may be acceptable for production). Background matching = may also play a role in setting the data flow and parallelization for coadd= ition, as it shares some of the processing.
Choices about which images to include in a coadd are still largely human= -directed, and need to be automated fully.
We have a flexible system for specifying coordinate systems, but we have= not yet done a detailed exploration of which projection(s) we should use i= n production, or determined the best scales for the "tract" and "patch" lev= els. We have no plans yet for how to deal with tract-level overlaps, = though it is likely this will occur at the catalog level, rather than in th= e images.
We have put no effort so far into analyzing and mitigating the effects o= f correlated noise, which will become more important when we regularly deal= with more than just direct coadds (but may be important even for these). &= nbsp;A major question here is how well we can propagate the noise covarianc= e matrix; exactly propagating all uncertainty information is not computatio= nally feasible, but there are several proposed approximation methods that a= re likely to be sufficient.
The algorithms here are relatively well-understood, at least formally, a= nd most of the research work is either in coming up with the appropriate sh= ortcuts to take (e.g. in propagating noise covariance, handling masks) or i= s covered below in detection/deblending/measurement. Almost all of th= e coding we need to for coaddition is pure Python, with the only real excep= tion being the stacker (unless we need to make drastic changes to the convo= lution or warping code to improve computational performance or deal with no= ise covariances).
Scheduling
PSF-matched coadds are a requirement for any serious look into options f= or estimating colors. The measurement codes themselves need some addi= tional testing before this becomes a blocker, however.
Likelihood coadds should be implemented as-needed when researching deep = detection algorithms.
Dataset and task cleanup should be done as soon as the new butler is ava= ilable.
The stacker rewrite and mask propagation should be fixed up before we sp= end too much time investigating measurement algorithm behavior on coadds, a= s it will make our lives much easier there if we start with sane pixel mask= s.
Noise covariance work may need to come up early in detection algorithm r= esearch, but if not, it's a relatively low priority, as it may not end up b= eing important unless we decide we need PSF-matched coadds for colors, = ;and we've determined that the measurement algorithms we want= to use there are affected by correlations in the noise.
Traditional background modeling involves estimating and subtracting the = background from individual exposures separately. While this will stil= l be necessary for visit-level processing, for deep processing we can use a= better approach. We start by PSF-matching and warping all but one of= the N input exposures (on a patch of sky) to match the final, reference ex= posure and subtract these exposures, using much of the same code we use for= Image Differencing. We then model the background of the N-1 differen= ce images, where, depending on the quality of the PSF-matching, we can fit = the instrumental background without interference from astrophysical backgro= unds. We can then combine all N original exposures and subtract the N= -1 background difference models, producing a coadd that contains the full-d= epth astrophysical signal from all exposures but an instrumental background= for just the reference exposure. We can then model and subtract that= final background using traditional methods, while taking advantage of the = higher signal-to-noise ratio of the sources in the coadd. We can then= also compute an improved background model for any of the individual exposu= res as the combination of its difference background relative to the referen= ce and the background model for the reference.
We have prototype code that works well for SDSS data, but experiments on= the HSC side have shown that processing non-drift-scan data is considerabl= y more difficult. One major challenge is selecting/creating a seamles= s reference exposure across amplifier, sensor, and visit boundaries, especi= ally in the presence of gain and linearity variations. We also need t= o think about how the flat-fielding and photometric calibration algorithms = interact with background matching, as the fact that the sky has a different= color than the sources makes it impossible to for a single photometric sol= ution to simultaneously generate both a seamless sky and correct photometry= - and we also need to be able to generate the final photometric calibratio= n using measurements that make use of only the cruder visit-level backgroun= d models.
The problem of generating a seamless reference image across the whole sk= y is very similar to the problem of building a template image for Image Dif= ferencing, and in fact the Image Differencing template or some other previo= usly-built coadd may be a better choice for the reference image, once those= coadds become available (this would require a small modification to the al= gorithm summarized above). Of course, in this case, the problem of bo= otstrapping those coadds would still remain.
We also don't yet have a good sense of where background matching belongs= in the overall processing flow. It seems to share many intermediates= with either Image Coaddition= a> or Image Differencing, depending on whether the background-difference fi= tting is done in the coadd coordinates system (most likely; shared work wit= h coaddition) or the original exposure frame (less likely, shared work with= Image Differencing). It is also unclear what spatial scale the model= ing needs to be done at, which could affect how we would want to paralleliz= e it.
This is a difficult algorithmic research problem that interacts in subtl= e ways with ISR, Photometric Self-Calibration, and the data flow and parall= elization for Image Coaddition<= /a>. It should not be a computational bottleneck on its own, but it w= ill likely need to piggyback on some other processing (e.g. Image Coaddition) to achieve this.
Because we can use traditional background modeling outputs as a placehol= der, and the improvement due to background matching is likely to only matte= r when we're trying to really push the precision of the overall system, we = can probably defer the complete implementation of background matching somew= hat. It may be a long research project, so we shouldn't delay too lon= g, though. We should have an earlier in-depth design period to sketch= out possible algorithmic options (and hopefully reject a few) and figure o= ut how it will fit into the overall processing.
Detection for isolated static sources with known morphology and SED is s= traightforward:
We may need to take some care with noise covariances introducing in warp= ing, and the details of finding peaks and growing Footprints are more heuri= stic than statistically rigorous. Nevertheless, the real challenge he= re is expanding this procedure to a continuous range of morphologies and SE= Ds. There are a spectrum of options here, with the two extremes being= :
We'll almost certainly use some mixture of these approaches; we'll certa= inly have more than one detection image, but it's not clear how many we'll = have.
While most transients and fast-moving objects will be detected via tradi= tional Image Differencing, we'll also want to build coadds that cover only = a certain range in time to detect faint long-term transients and faint movi= ng objects. We can use the same approaches mentioned above on these c= oadds.
Most of the low-level code to run detection already exists, though it ma= y need some minor improvements (e.g. handling noise correlations, improving= initial peak centroids, dealing with bad and out-of-bounds pixels). = A notable exception is our lack of support for building likelihood coa= dds.
We have a large algorithmic landscape to explore for how to put the high= -level pieces together, as discussed in the Algorithm section. This i= s essentially a lot of scripting and experimenting, with attention to the c= omputational performance as well as the scientific performance. We pr= obably need to put some effort into making sure the low-level components ar= e optimized to the point where we won't draw the wrong conclusions about pe= rformance from different high-level configurations. Good test datase= ts are extremely important here, with some sort of truth catalog an importa= nt ingredient: we probably want to run on both PhoSim data and real ground-= based data (ideally HSC or DECam) in a small field that has significantly d= eeper HST coverage.
We should spend some time looking into how we might evaluate the signifi= cance of a peak, post-detection; if we can do this efficiently, it makes th= e low-threshold/cull approach much more attractive. On the other side= , we put at least a little effort into seeing how a threshold-on-linear-com= bination algorithm would look, in terms of operations per pixel and memory = usage.
We'll need to pass along a lot more information with peaks, and we have =
a prototype for doing this using afw::table
on the H=
SC side. Even once that's ported over, there will be more work to be =
done in determining what the fields should be.
The position of Deep Detection= in the overall processing flow is moderately secure; we'll need to bui= ld several coadds in advance, then process them (at least mostly) one at a = time, while probably saving the Footprints and peaks to temporary storage. = We may build some additional coadds on-the-fly from the pre-built coa= dds (and it's unlikely we'll need to write any of these to temporary storag= e).
Skills Required
Mostly experiment and analysis work, with significant high-level Python = coding and a bit of low-level C++ coding (which could be done by separate d= evelopers). Algorithms work involves using empirically-motivated heur= istics to extend a statistically rigorous method to regimes where it's not = formally valid, and it's at some level stuff that's all been done before fo= r previous surveys - once we substitute likelihood coadds for single-visit = likelihood maps, there's nothing special about multi-epoch processing here.= It's the fact that we want to do a better job that drives the algori= thm development here.
There's no sense scheduling any real development here until likelihood c= oadds and high-quality test data are available, and the fact that the algor= ithmic options here all fit within the same general processing flow means t= hat changes here won't be that disruptive to the rest of the pipeline, so i= t's not a top priority from a planning standpoint. But we will want i= t to be relatively mature by the time we get to end-game development on the= (harder) association, deblending, and measurement tasks, so we can feed ex= periments on those algorithms with future-proof inputs.
In Peak Association, we combine all of the detection outputs that cover = a patch of sky, including not just Deep Detection outputs but peaks and Footprints that correspond to tr= ansients and moving objects, as obtained from Image Differencing. Eac= h set of Footprints and peaks from a different origin - which we'll call "d= etection layers" - represents a different (and conflicting) view of the obj= ects present in that patch of sky. Footprints from different layers = that overlap can probably be straightforwardly merged, which can of course = result in Footprints that were separate in some layers being combined in th= e end. Within each merged Footprint, the algorithm must also associat= e peaks from different detection images that correspond to the same object.= This is much harder, as a peak in one layer may correspond to multip= le peaks in another layer.
Our baseline plan (derived from the SDSS approach) involves an algorithm= that does this without access to any actual pixel data; our hope is to pas= s enough information in the peaks themselves from detection to allow this s= tage to proceed entirely at the catalog level (though these catalogs will i= nclude Footprints, so they do contain some pixel-like information).
While generic algorithms for spatial matching may be useful for determin= ing sets of overlapping Footprints, the work of merging their peaks is esse= ntially a heuristic string of decisions based on thresholds and empirical t= esting, based on our understanding of the origins of the peaks and the prop= erties of the peaks themselves. As such, it is unlikely we will be ab= le to provide rigorous guarantees about its performance, aside from what we= can get by running the algorithm on test datasets with associated truth ca= talogs.
While we have a very crude prototype on the HSC fork, this stage is unim= plemented on the LSST side, and our best choice is probably to start from s= cratch, using the SDSS implementation as a guide.
We'll probably use the same test datasets as in Deep Detection, and probably run a lot of experiments= for the two stages jointly. Defining metrics for performance will be= an important early step, especially once we get beyond fixing the egregiou= s failures, and have to start making tradeoffs that can improve behavior fo= r some blends while making others worse.
We can probably write a preliminary version that will handle ~80% of obj= ects (isolated or simple blends) correctly; the last 20% (or 19.9%, or what= ever we do get correct in the end!) will take 80% of the effort.
It is unlikely we will get much use out of the existing Source Associati= on code or Footprint merge code.
We should make some effort to make Peak Association consistent across pa= tch boundaries.
A major algorithmic question is whether to allow any conflicts to remain= to be resolved by the deblender and/or measurement; if we view the output = of Peak Association as a hypothesis on the number of objects in a blend (an= d their rough positions) to be evaluated by the deblender, it may be valuab= le to allow multiple hypotheses to be passed to the deblender. We alr= eady intend to do this, in a small way: we intend to measure each the "pare= nt" (everything in a footprint) as a single source, in addition to measurin= g the "child" sources individually. It may make sense to allow this a= t multiple levels, though clearly this has performance implications for lat= er stages of the pipeline, and we'd want to reject as many hypotheses as po= ssible early. This can be seen as a way to ensure the full Peak Assoc= iation algorithm does not require pixel data; if we run into problems that&= nbsp;do require pixel data, we'll punt them off to later stages of= the pipeline.
This will be almost entirely C++ work, with a lot of trial-and-error exp= erimentation involved.
Most work should happen after we've made significant progress on Deep De= tection, but we probably want to sketch out high-level interfaces and put t= ogether some sort of placeholder quite early. That placeholder could = probably take us quite far in terms of keeping lack of effort here from blo= cking work elsewhere.
In single-frame deblending, we attempt to allocate flux from each pixel = to all of the sources who have contributed it, which can later be used to m= easure these objects individually. In deep deblending, we need to ext= end this procedure to multiple epochs and multiple bands (possibly using co= adds), while extending the notion of separating "per-pixel" fluxes into wha= tever we need to separate objects for measurements. We start with the= consistent set of peaks and footprints output by Peak Association; this gives us the complete set of = sources (assuming, for now, that Peak Association produces only a single hypothesis, as discussed abov= e).
Here's a proposal for the baseline algorithm - essentially the same as u= sed SDSS, but operating on per-band direct coadds: (TODO: add mathematical = definitions)
These variations range from straightforward extensions of the baseline t= o very open-ended explorations of different ideas, and it is by no means ex= haustive.
In addition to attempting to apportion pixel flux between sources, we al= so plan to fit multiple sources simultaneously (in Deep Measurement).  = ;And, of course, those fits can also be used to reapportion fluxes. H= owever, we strongly believe there will always be a need for flux apportioni= ng:
It is also possible that a flux apportioning method followed by individu= al model fitting may outperform simultaneous model fits, if the models used= in the fitting are relatively simplistic but the flux apportioning templat= es are more flexible.
We have a single-epoch deblender that implements the symmetry-based temp= lates on a single epoch and apportions flux into HeavyFootprints using them= . This code needs some refactoring to allow its components to be reus= ed more easily for many of the the algorithmic experiments we need to do.= p>
We also have a prototype for a multi-band deblender that has most of the= features of the baseline plan, on the HSC fork of the codebase, which will= be ported to the LSST side of the codebase in the very near future. = This is not expected to survive to production, even if the baseline approac= h turns out to be very effective, as the data flow doesn't support some fea= tures of the baseline that are expected to be critical, particularly the re= quirement that PSFs are identified consistently in all bands.
The production data flow for deblending is completely undetermined right= now, with a major open question being the data axis for parallelization:= p>
It may be that my concerns about memory are essentially unfounded, but w= e need to do some number-crunching to figure that out.
There are many algorithmic possibilities that need to be implemented and= investigated. There is a huge amount of work to do here.
We don't have a good metric for how good a deblender is, or good test da= tasets on which to evaluate them. We'll probably have to rely a lot o= n human inspection of real data, along with some use of simulations and req= uiring that we obtain sensible color-color diagrams from deblended fluxes.<= /p>
Everything: we'll need to write lots of qualitatively new C++ code, desi= gn a big complex system (probably mostly in Python?), possibly with non-tri= vial parallelization, explore and expand on completely new algorithms, and = do a ton of experimentation.
We need to start work on this early, and work on it essentially continuo= usly after that. The first stages need to be:
Once we have a sense of the data flow, we can put together a very high-l= evel interface that should support any of the algorithmic options we want t= o try, which we can then write as plugins. We'll then probably want t= o implement multiple plugins before we can test them fully - because truly = rigorous tests will depend on having very mature versions of other pipeline= s - while ruling out as many as we can using simpler tests.