Overview

This page attempts to capture at a high level the software and algorithm development necessary to implement the processing of objects detected at the full survey (at the time of a particular data release), including the detection, deblending, and measurement of sources too faint to be detected in any individual visit.  The algorithms to be used here are generally poorly understood; we have many options for extending well-understood algorithms for processing single-epoch data to multi-epoch data, and considerable research is needed to find the right balance between computational and scientific performance in doing so.  Unfortunately, different algorithmic options may require vastly different parallelization and data flow, so we cannot yet make assertions about even the high-level interfaces and structure of the code.  We do, however, have a good understanding of most of the needed low-level algorithms, so our goal should be to implement these as reusable components that will allow us to quickly explore different algorithmic options.  This will also require early access to parallelization interfaces, test data, and analysis tools that will be developed outside the DRP algorithms team.

Inputs

* We don't need Image Differencing outputs to start the Deep Processing (e.g. we can probably do Image Coaddition first), and there may be some value to doing the DRP Image Differencing at the same time as some parts of the Deep Processing (Deep Background Modeling, in particular).

Stages/Components

The next few sections contain the very components of the deep processing, in rough order - the exact flow is very much TBD.  In fact, the lines drawn between several components are also somewhat arbitrary; the distinctions between detection, peak association, deblending, and measurement are based on a baseline algorithm that is derived largely from the SDSS processing, which was considerably simplified by only needing to process a single epoch for every band.  Algorithms significantly different from this baseline would define these steps and the interfaces between them differently.  Assuming that baseline, or something close to it:

Image Coaddition

We'll almost certainly need some sort of coadded image to detect faint sources, and do at least preliminary deblending and measurement.  We'll use at least most the same code to generate templates for Image Differencing.

Because there's no single coadd that best represents all the data it summarizes, there are several different kinds of coadds we may want to produce.  It's best to think of coaddition as a major piece of the algorithms below, but one which will be executed up front and reused by different stages of the processing; while there will be algorithmic research involved in determining which kinds of coadds we build, that research effort will be covered under the detection, deblending, and measurement sections, below.

All coadds will be produced by roughly the following steps:

  1. Warp to the coadd coordinate system
  2. Convolution (coadd type dependent)
  3. Scale fluxes to coadd photometric system
  4. Stacking

The overall processing flow for all coadds will likely be the same (though the final processing flow may or may not match our current prototype), and will share most of the same code.

Types of Coadds

For any of these coadd types, we may also want to coadd across bands, either to produce a χ2 coadd or a weighting that matches a given source spectrum.  It should be possible to create any multi-band coadd by adding single-band coadds; we do not anticipate having to go back to individual exposures to create a multi-band coadds.

Input Images

We will also need different coadds for different sets of input images.  As discussed above, including images with broad PSFs in direct and PSF-matched coadds degrades the image quality of the coadd even as it adds depth, so we will likely want to produce coadds of these types that are optimized for image quality in addition to those that are optimized for depth.  We will also want to build coadds that correspond only to certain ranges in time, in order to detect (and possibly deblend and measure) moving sources with intermediate speeds (i.e. offsets that are a small but nonnegligible fraction of the PSF size between epochs) that are below the single-frame detection limit.

Status/Challenges

We have a relatively mature pipeline for direct coadds that should only require minor modification to produce at least crude PSF-matched coadds.  We need to finish (fix/reimplement) the PSF-matching coadds, and do some testing to verify that they're working properly.  We haven't put any effort at all into likelihood coadds yet, and we need to completely reimplement our approach to χ2 and other multi-band coadds.

Our propagation of mask pixels to coadds needs to be reviewed and probably reconsidered in some respects.

Our stacker class is a clumsy and inflexible, and needs to be rewritten.  This is a prerequisite for really cleaning up the handling of masks.

We do not yet have a settled policy on how to handle transient, variable, or moving objects in coadds.  Downstream processing would probably be simplified if transient and moving objects that can be detected at the single-epoch limit using Image Differencing were excluded entirely from the coadds, but some types of coadds do not handle rejecting masked pixels gracefully (at least in theory; this level of masking may work acceptably in practice), and we may want to adopt some other policy if this proves difficult.  Similarly, it would probably be convenient if coadds were defined such that variable sources have their mean flux on the coadd, but this may not be worthwhile if the implementation proves difficult.

The coadd datasets are a bit of a mess, and need a redesign.

The data flow and parallelization is currently split into two separate tasks to allow for different parallelization axes, and the stacking step is parallelized at a coarser level than would be desirable for rapid development (though it may be acceptable for production).  Background matching may also play a role in setting the data flow and parallelization for coaddition, as it shares some of the processing.

Choices about which images to include in a coadd are still largely human-directed, and need to be automated fully.

We have a flexible system for specifying coordinate systems, but we have not yet done a detailed exploration of which projection(s) we should use in production, or determined the best scales for the "tract" and "patch" levels.  We have no plans yet for how to deal with tract-level overlaps, though it is likely this will occur at the catalog level, rather than in the images.

We have put no effort so far into analyzing and mitigating the effects of correlated noise, which will become more important when we regularly deal with more than just direct coadds (but may be important even for these).  A major question here is how well we can propagate the noise covariance matrix; exactly propagating all uncertainty information is not computationally feasible, but there are several proposed approximation methods that are likely to be sufficient.

Dependencies

Skills Required

The algorithms here are relatively well-understood, at least formally, and most of the research work is either in coming up with the appropriate shortcuts to take (e.g. in propagating noise covariance, handling masks) or is covered below in detection/deblending/measurement.  Almost all of the coding we need to for coaddition is pure Python, with the only real exception being the stacker (unless we need to make drastic changes to the convolution or warping code to improve computational performance or deal with noise covariances).

Scheduling

PSF-matched coadds are a requirement for any serious look into options for estimating colors.  The measurement codes themselves need some additional testing before this becomes a blocker, however.

Likelihood coadds should be implemented as-needed when researching deep detection algorithms.

Dataset and task cleanup should be done as soon as the new butler is available.

The stacker rewrite and mask propagation should be fixed up before we spend too much time investigating measurement algorithm behavior on coadds, as it will make our lives much easier there if we start with sane pixel masks.

Noise covariance work may need to come up early in detection algorithm research, but if not, it's a relatively low priority, as it may not end up being important unless we decide we need PSF-matched coadds for colors, and we've determined that the measurement algorithms we want to use there are affected by correlations in the noise.

Deep Background Modeling

Algorithm

Traditional background modeling involves estimating and subtracting the background from individual exposures separately.  While this will still be necessary for visit-level processing, for deep processing we can use a better approach.  We start by PSF-matching and warping all but one of the N input exposures (on a patch of sky) to match the final, reference exposure and subtract these exposures, using much of the same code we use for Image Differencing.  We then model the background of the N-1 difference images, where, depending on the quality of the PSF-matching, we can fit the instrumental background without interference from astrophysical backgrounds.  We can then combine all N original exposures and subtract the N-1 background difference models, producing a coadd that contains the full-depth astrophysical signal from all exposures but an instrumental background for just the reference exposure.  We can then model and subtract that final background using traditional methods, while taking advantage of the higher signal-to-noise ratio of the sources in the coadd.  We can then also compute an improved background model for any of the individual exposures as the combination of its difference background relative to the reference and the background model for the reference.

Status/Challenges

We have prototype code that works well for SDSS data, but experiments on the HSC side have shown that processing non-drift-scan data is considerably more difficult.  One major challenge is selecting/creating a seamless reference exposure across amplifier, sensor, and visit boundaries, especially in the presence of gain and linearity variations.  We also need to think about how the flat-fielding and photometric calibration algorithms interact with background matching, as the fact that the sky has a different color than the sources makes it impossible to for a single photometric solution to simultaneously generate both a seamless sky and correct photometry - and we also need to be able to generate the final photometric calibration using measurements that make use of only the cruder visit-level background models.

The problem of generating a seamless reference image across the whole sky is very similar to the problem of building a template image for Image Differencing, and in fact the Image Differencing template or some other previously-built coadd may be a better choice for the reference image, once those coadds become available (this would require a small modification to the algorithm summarized above).  Of course, in this case, the problem of bootstrapping those coadds would still remain.

We also don't yet have a good sense of where background matching belongs in the overall processing flow.  It seems to share many intermediates with either Image Coaddition or Image Differencing, depending on whether the background-difference fitting is done in the coadd coordinates system (most likely; shared work with coaddition) or the original exposure frame (less likely, shared work with Image Differencing).  It is also unclear what spatial scale the modeling needs to be done at, which could affect how we would want to parallelize it.

Dependencies

Skills Required

This is a difficult algorithmic research problem that interacts in subtle ways with ISR, Photometric Self-Calibration, and the data flow and parallelization for Image Coaddition.  It should not be a computational bottleneck on its own, but it will likely need to piggyback on some other processing (e.g. Image Coaddition) to achieve this.

Scheduling

Because we can use traditional background modeling outputs as a placeholder, and the improvement due to background matching is likely to only matter when we're trying to really push the precision of the overall system, we can probably defer the complete implementation of background matching somewhat.  It may be a long research project, so we shouldn't delay too long, though.  We should have an earlier in-depth design period to sketch out possible algorithmic options (and hopefully reject a few) and figure out how it will fit into the overall processing.

Deep Detection

Algorithm

Detection for isolated static sources with known morphology and SED is straightforward:

We may need to take some care with noise covariances introducing in warping, and the details of finding peaks and growing Footprints are more heuristic than statistically rigorous.  Nevertheless, the real challenge here is expanding this procedure to a continuous range of morphologies and SEDs.  There are a spectrum of options here, with the two extremes being:

We'll almost certainly use some mixture of these approaches.  

While most transients and fast-moving objects will be detected via traditional Image Differencing, we'll also want to build coadds that cover only a certain range in time to detect faint long-term transients and faint moving objects.  We can use the same approaches mentioned above on these coadds.

Status/Challenges

Most of the low-level code to run detection already exists, though it may need some minor improvements (e.g. handling noise correlations, improving initial peak centroids, dealing with bad and out-of-bounds pixels).  A notable exception is our lack of support for building likelihood coadds.

We have a large algorithmic landscape to explore for how to put the high-level pieces together, as discussed in the Algorithm section.  This is essentially a lot of scripting and experimenting, with attention to the computational performance as well as the scientific performance.  We probably need to put some effort into making sure the low-level components are optimized to the point where we won't draw the wrong conclusions about performance from different high-level configurations.   Good test datasets are extremely important here, with some sort of truth catalog an important ingredient: we probably want to run on both PhoSim data and real ground-based data (ideally HSC or DECam) in a small field that has significantly deeper HST coverage.

We should spend some time looking into how we might evaluate the significance of a peak, post-detection; if we can do this efficiently, it makes the low-threshold/cull approach much more attractive.  On the other side, we put at least a little effort into seeing how a threshold-on-linear-combination algorithm would look, in terms of operations per pixel and memory usage.

We'll need to pass along a lot more information with peaks, and we have a prototype for doing this using afw::table on the HSC side.  Even once that's ported over, there will be more work to be done in determining what the fields should be.

The position of Deep Detection in the overall processing flow is moderately secure; we'll need to build several coadds in advance, then process them (at least mostly) one at a time, while probably saving the Footprints and peaks to temporary storage.  We may build some additional coadds on-the-fly from the pre-built coadds (and it's unlikely we'll need to write any of these to temporary storage).

Dependencies

Skills Required

Mostly experiment and analysis work, with significant high-level Python coding and a bit of low-level C++ coding (which could be done by separate developers).  Algorithms work involves using empirically-motivated heuristics to extend a statistically rigorous method to regimes where it's not formally valid, and it's at some level stuff that's all been done before for previous surveys - once we substitute likelihood coadds for single-visit likelihood maps, there's nothing special about multi-epoch processing here.  It's the fact that we want to do a better job that drives the algorithm development here.

Scheduling

There's no sense scheduling any real development here until likelihood coadds and high-quality test data are available, and the fact that the algorithmic options here all fit within the same general processing flow means that changes here won't be that disruptive to the rest of the pipeline, so it's not a top priority from a planning standpoint.  But we will want it to be relatively mature by the time we get to end-game development on the (harder) association, deblending, and measurement tasks, so we can feed experiments on those algorithms with future-proof inputs.

Peak Association

Deep Deblending

Deep Measurement

Deep Aperture Corrections