Data Access Hangout 2015-01-26

Date

26 Jan 2015

API versions - should we have "current" version?

difficult to maintain
based on best practices from others - they don't do that
so, "no". We can easily add it later if we decide to
version will change only when there is a change breaking the API
extending API can be handled without breaking the API
- unless we have to roll back a new change, and new clients have already been deployed
API won't change much, so no need to deal with major, minor etc, single number should suffice

paging - should we break results into pages?

we don't really need "paging", it is driven by the fact we want to minimize damage if someone accidentally requests huge amount of data (e.g. all images ever produced by lsst), that will be easy to do through the API
don't call it "pageSize", call it "maxResult" / "firstResult"
set by default to something relatively large
tricky for dynamic data (e.g., for continuously changing L1)
it will be OK if it works the same way as MySQL paging, e.g., we don't have to try super hard to keep results stable as we page through them.
- sorting results might help keep the results stable (e.g sorting that ensures new data is added at the end)
definitely document all this
add to API indicator that will allow users to determine if result is stable or not
potentially http chunked responses might help to lessen server load. But this is harder for json/xml

image type - should image types be part of URI?

we are talking about coadds, raw, calpex, etc here, not formats (eg not fits, jpeg)
don't call it image "type". Proposed name: call it image "kind"
note that image kind is part of primary key, together with the id.
- e.g., two different kinds of images might end up having the same id, but they will have nothing in common

multiple images

lsst pipelines will always produce images with image plane, mask and variance, all 3 together in one physical fits file
some will argue "never give data without mask" etc, but we should optimize performance, network traffic etc, and deliver only what user really wants
decision: by default, deliver entire image with all 3 planes, but allow selecting individual planes
use commonly used rest notation ";", as Gregory documented in comments of DM-1694 - Getting issue details... STATUS (so our M12 from API "GET /meta/v0/image/coadd;mask/12345" would become "GET /meta/v0/image/coadd/12345;plane=data;mask)

image ids: "/image/123" vs "/image?id=123"

cutout - is it separate resource or not?

1. retrieving an existing image or part of such image (and the image already has an id etc), if we are using original image coordinate system - there is no need to create a new resource
2. cutout that involves complex operations (stitching etc), or rotating, or transforming coordinates etc. Here we need to produce a new resource

the first case will be rare, for internal debugging etc. Selection criteria will be very limited, a simple "ra, dec + height/length in arcsec" should cover most cases
note that raw images will be in random rotations, so in most cases we will want at minimum rotate them, which already puts us in "case 2"
note, I6 from API page needs to be rewritten
to limit (full vs cutout), one can use ";" notation:
- GET /image/v0/coadd/12345;full
- GET /image/v0/coadd/12345;cutout

metadata for columns returned by "GET /db/v0/query"

things like units, types, null/not null etc
in most cases, we want to avoid extra call, so we should deliver metadata with query results
so, send with data. How fancy we get is format dependent, eg
- in csv, typically people specify column names in first line as comment
- in json, key-value pairs, specify everything: names and types etc
btw, we will need to implement a dedicated call to get just metadata too
and we will need something like "GET /db/v0/query/explain" to get estimated query time, # chunks involved etc.

result format for db queries?

Topics not covered are being moved to next week hangout, see Data Access Hangout 2015-02-02