...
Every science user should have MySQL credentials.
...
Relevant Off-the-shelf Systems
IRODS
- http://irods.org/
scalability
- today's large-scale installations: o(100) million files, o(100) simultaneous users
- Federated Icat System used to distribute load on metadata for very large installations
- useful features: automatically (dynamically) applying rules and triggers, enforcing policies
- for example: generate and store checksum, verify checksum on read, automatically group (tar) small files before storing in MSS
- APIs
- command line shell
- java, php, python
- metadata
- options for catalog backend: PostgreSQL, Oracle, MySQL
- based on AVU (attribute-value-unit) triplets. (This might impair performance - structured metadata would be faster)
- few observations from Andy Salnikov based on experience at LCLS:
- building iRODS is complicated, mostly interactive process, though I managed to wrap it into RPM script
- client APIs for iRODS look cumbersome, especially C API
- building Python wrappers (which is poorly-supported third-party stuff) needs some non-trivial steps like rebuilding iRODS libraries with -fPIC
- iRODS works best if all data is accessed through iRODS only, without direct file-system level access to data
- iRODS only knows one checksum type (md5, I believe, which is a bit CPU-intensive)
- for some things we had to mess with ICAT database directly
- MySQL backend may not be well supported, Postgres is their standard option (I wrote MySQL backend myself, and it is a bit messy)
FERMI Data Catalog
- https://github.com/brianv0/datacat-doc/blob/master/LSST-Datacat-overview.md
http://docs.datacatalog.apiary.io (rest api, some out of date: there is no “children” resource anymore, the blurb about 302/303 codes is misleading, field names have changed slightly. Will be revisited shortly)
catalog for image data, loosely coupled with the data
- originally developed for Fermi Gamma-ray, used in production since 2007
- key features: metadata (blended mix of structured and unstructured), crawlers with pluggable project-specific components REST api, efficient searching, handles ACLs
- managing 23 millions files for Fermi
- spacial optimizations: currently as a separate application built on top of the catalog, but could be done inside the dataCat if needed
- backend: originally Oracle, now supports MySQL (porting almost finished, undergoing final tests)
...