Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Every science user should have MySQL credentials.

...

Relevant Off-the-shelf Systems

 

IRODS

 

  • http://irods.org/
  • scalability

    • today's large-scale installations: o(100) million files, o(100) simultaneous users
    • Federated Icat System used to distribute load on metadata for very large installations
  • useful features: automatically (dynamically) applying rules and triggers, enforcing policies
    • for example: generate and store checksum, verify checksum on read, automatically group (tar) small files before storing in MSS
  • APIs
    • command line shell
    • java, php, python
  • metadata
    • options for catalog backend: PostgreSQL, Oracle, MySQL
    • based on AVU (attribute-value-unit) triplets. (This might impair performance  - structured metadata would be faster)
  • few observations from Andy Salnikov based on experience at LCLS:
    • building iRODS is complicated, mostly interactive process, though I managed to wrap it into RPM script
    • client APIs for iRODS look cumbersome, especially C API
    • building Python wrappers (which is poorly-supported third-party stuff) needs some non-trivial steps like rebuilding iRODS libraries with -fPIC
    • iRODS works best if all data is accessed through iRODS only, without direct file-system level access to data
    • iRODS only knows one checksum type (md5, I believe, which is a bit CPU-intensive)
    • for some things we had to mess with ICAT database directly
    • MySQL backend may not be well supported, Postgres is their standard option (I wrote MySQL backend myself, and it is a bit messy)
 

FERMI Data Catalog

 

  • https://github.com/brianv0/datacat-doc/blob/master/LSST-Datacat-overview.md
  • http://docs.datacatalog.apiary.io (rest api, some out of date: there is no “children” resource anymore, the blurb about 302/303 codes is misleading, field names have changed slightly. Will be revisited shortly)

  • catalog for image data, loosely coupled with the data

  • originally developed for Fermi Gamma-ray, used in production since 2007
  • key features: metadata (blended mix of structured and unstructured), crawlers with pluggable project-specific components REST api, efficient searching, handles ACLs
  • managing 23 millions files for Fermi
  • spacial optimizations: currently as a separate application built on top of the catalog, but could be done inside the dataCat if needed
  • backend: originally Oracle, now supports MySQL (porting almost finished, undergoing final tests)

...