Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Attendees

...

Related page

  • Web Data Access

...

Discussion items

API versions - should we have "current" version?

...

  • difficult to maintain
  • based on best practices from others - they don't do that
  • so, "no". We can easily add it later if we decide to
  • version will change only when there is a change breaking the API
  • extending API can be handled without breaking the API
    • unless we have to roll back a new change, and new clients have already been deployed
  • API won't change much, so no need to deal with major, minor etc, single number should suffice

paging - should we break results into pages?

  • we don't really need "paging", it is driven by the fact we want to minimize damage if someone accidentally requests huge amount of data (e.g. all images ever produced by lsst), that will be easy to do through the API
  • don't call it "pageSize", call it "maxResult" / "firstResult"
  • set by default to something relatively large
  • tricky for dynamic data (e.g., for continuously changing L1)
  • it will be OK if it works the same way as MySQL paging, e.g., we don't have to try super hard to keep results stable as we page through them. 
    • sorting results might help keep the results stable (e.g sorting that ensures new data is added at the end)
  • definitely document all this
  • add to API indicator that will allow users to determine if result is stable or not
  • potentially http chunked responses might help to lessen server load. But this is harder for json/xml

image type - should

...

image types be part of URI?

...

  • how about "GET /db/v0/query/explain"? It could return estimated query time, # chunks involved etc.

...

  • how about "GET /db/v0/query?type=async"?, It could return "queryId=<uniqueQueryId>, location=<url of the result>"
  • GET /db/v0/query/status?queryId=<qId? to get status of async query

  • we are talking about coadds, raw, calpex, etc here, not formats (eg not fits, jpeg)
  • don't call it image "type". Proposed name: call it image "kind"
  • note that image kind is part of primary key, together with the id.
    • e.g., two different kinds of images might end up having the same id, but they will have nothing in common

multiple images

  • lsst pipelines will always produce images with image plane, mask and variance, all 3 together in one physical fits file
  • some will argue "never give data without mask" etc, but we should optimize performance, network traffic etc, and deliver only what user really wants
  • decision: by default, deliver entire image with all 3 planes, but allow selecting individual planes 
  • use commonly used rest notation ";", as Gregory documented in comments of  
    Jira
    serverJIRA
    serverId9da94fb6-5771-303d-a785-1b6c5ab0f2d2
    keyDM-1694
     (so our M12 from API "GET /meta/v0/image/coadd;mask/12345" would become "GET /meta/v0/image/coadd/12345;plane=data;mask)

image ids: "/image/123" vs "/image?id=123"

  • the former

cutout - is it separate resource or not?

  • depends. Two cases here:
    1. retrieving an existing image or part of such image (and the image already has an id etc), if we are using original image coordinate system - there is no need to create a new resource
    2. cutout that involves complex operations (stitching etc), or rotating, or transforming coordinates etc. Here we need to produce a new resource
  • the first case will be rare, for internal debugging etc. Selection criteria will be very limited, a simple "ra, dec + height/length in arcsec" should cover most cases
  • note that raw images will be in random rotations, so in most cases we will want at minimum rotate them, which already puts us in "case 2"
  • note, I6 from API page needs to be rewritten
  • to limit (full vs cutout), one can use ";" notation: 
    • GET /image/v0/coadd/12345;full

    • GET /image/v0/coadd/12345;cutout

metadata for columns returned by "GET /db/v0/query"

  • things like units, types, null/not null etc
  • in most cases, we want to avoid extra call, so we should deliver metadata with query results
  • so, send with data. How fancy we get is format dependent, eg
    • in csv, typically people specify column names in first line as comment
    • in json, key-value pairs, specify everything: names and types etc
  • btw, we will need to implement a dedicated call to get just metadata too
  • and we will need something like "GET /db/v0/query/explain" to get estimated query time, # chunks involved etc.

result format for db queries?

  • for now use IPAC table format
    • ipac team will provide the spec

Topics not covered are being moved to next week hangout, see Data Access Hangout 2015-02-02

 

Terminology

...

In IPAC archives, dataset always refers to a collection of data. 

  • For WISE, each data release is a dataset: like preliminary release, allsky release, AllWISE release, ...

  • For Spitzer, each legacy or exploration team produce a set of data, they are re  leased with each team name or the program name 

  • Sometimes a dataset includes both catalogs and images; sometimes only catalogs   or images in a dataset.

  • Inside a dataset, they are further separated into smaller units, sometimes by   type (catlog, image, spectrum).

  • Images can be further separated by level of data products (level1-single frame   image, leve2-coadded image), and by  wavelength (wise has 4 bands).

  • For each image, there are also ancillary data associated with it, like artifacts, uncertainty, coverage, PSF file, mask, ...

...