Running RC2 via Gen3 + Pegasus + Oracle  

Not as a production operator*, but as a special developer who shares repo with other developers.   Not delivering a polished, efficient service, but rather delivering the starting point for study and more work.   Besides the presentation, trying to have something usable/maintainable for the following 3 months or so while next set of work is ongoing.

Deliverables

  • Results of running Gen3 DRP pipeline on RC2 HSC data on lsst-dev

    • Using Oracle as backend for registry of shared butler repository

    • Pegasus for workflow

  • Hsin-Fang can run Gen3 as part of the monthly HSC-RC2 reprocessing runs 

    • Incorporate into procedures with lower expectations than Gen2 (failure rate, usefulness of outputs, operability, etc)

  • Instructions for friendly-user developers using Gen3 Butler with Oracle

    • Including accessing weekly ci_hsc repo and RC outputs

    • Assuming running in Pegasus would still be too rough for most friendly-user developers

  • Maintainable Registry code that does not require updating a copy of large portions of the code for each RDBMS for every code change

Why These Deliverables

  1. Larger scale than ci_hsc

  2. Oracle admins can start seeing data flow and more easily provide feedback

  3. Increases visibility of Gen3 results

  4. Gen2 RC running is one of the blockers to deprecating Gen2

  5. Unblocks multi-registry work required for separation of production data from user data

  6. Provides insight for the Batch Processing Service design doc

Delivery Date

DMLT: June 4-6, 2019

To do  (Mid-level description to enable low level work and effort generation)

  1. Design how to work in Oracle for demo and 3+ months following Oracle schemas 
    1. Distinct use cases will use distinct schemas (Registry developer, Pipetask developer, RC data user). 
      1. RC runs - Oracle admins owned and maintained. 
      2. Weekly ci_hsc data loaded by Oracle admins into shared schema. 
      3. More frequent schema updates - User owned and maintained. 
        1. Need easy way to update schema with latest changes or start from scratch (Gen3) Schema evolution (schema migration scripts? (Gen3)) 
        2. Oracle Server setup (NCSA) 
          1. Recoverability - shared registries increase risk - Nightly exports of every schema will be scheduled with 2 week retention until data volumes warrant a different approach. 
          2. Availability - Standard maintenance windows and support during business hours. 
          3. Authentication - Oracle wallets (initially created by db admins). 
        3. Install software on lsst-dev:  Oracle client software + cx_Oracle (NCSA) 
  2. Increased testing:
    1. More tests in PipelineTasks, ctrl_mpexec, etc  (Jenkins + sqlite)
    2. Running ci_hsc with Gen3 (sqlite) in Jenkins
    3. Manual running of daf_butler tests against Oracle (pre-existing schema)
    4. Stretch goals:
      1. Jenkins running daf_butler tests against Oracle (full setup and teardown of schema)
      2. Running ci_hsc with Gen3 (oracle) in Jenkins
  3. Have different output DataStores for different users (details TBD) 
  4. daf_butler refactoring work to decrease additional changes needed to function with multiple RDBMS products (Gen3). 
  5. Oracle specific Butler changes (NCSA + Gen3, blocked by refactoring work) 
  6. Need RC2 initial repo (Gen3) 
    1. (prefer) Ingest raw executable (+ script to make easier to start from scratch) (Gen3)  Calib files may be ingest + script to set ranges 
    2. Or conversion from Gen2 HSC-RC2 reprocessing runs (like we do with ci_hsc) (ChrisW) Set initial WCS  (only explicit update, not select best)
  7.  More Pipeline Tasks to convert to Gen3  (DRP) 
    1. SkyCorrection (needs to be broken up into smaller tasks)
    2. JointCal (cannot be run on ci_hsc data set, needs more data)
  8. Change template to have unique filenames for RC runs 
    1. Hopefully just saving the templates to a file.   (NCSA) 
    2. Unknown if particular values in templates would require any Butler changes (Gen3)
  9. Batch Processing Service - NCSA 
    1. Assuming still using Andy’s pipetask as the activator 
    2. Need execution config (in particular cpu/memory requirements)
    3. Changes to allocNodes to set up HTCondor pool with partitionable slots
    4. Helpful status/monitoring scripts TBD 
    5. Note the following are blocked by Gen3 development and are not part of this deliverable:
      1. Must always start from beginning of submission (no retries or restarts) 
      2. Must be shared repo model (no job scratch, no Pegasus file transfer) 
  10. RC2 dataset challenges
    1. Single frame processing failures should not halt running
      1. current proposed solution: config option to always write files 
    2. Missing warp file should not halt running
      1. Ran into this with ci_hsc - config option exists to always write files
  11. ci_hsc/RC2 output usable from NCSA LSP 
    1. Oracle software accessible from NCSA LSP (NCSA + LSP/SQRE) 
    2. Not supporting Pegasus submissions from LSP for this milestone

* Why the note about not being Production?

  • Missing separation of production data from user data (requires user write access to production schema)

  • The outputs of a production pipeline should not be directly written to the production Data Backbone (or central database in general) to allow the Batch Production Service to:

    • Minimize database connections

    • Use various methods for retries and restarts

  • Many missing Batch Production Service production features, some of which are blocked by not-yet-implemented Gen3 features.

Current lsst-dev Oracle Instructions

  • For this milestone, no attempts will be made to make Oracle a part of the lsst_stack.
  • Oracle instantclient and cx_oracle are currently installed on lsst-dev in /project/production/oracle.
  • Example Oracle environment settings are in /project/production/oracle/oracle_env-v1.sh.   Should not affect environment set up by LSST stack.
  • Untar Oracle wallet tar given to you by admin in some directory in your home directory..
  • The admin will also have given you a net service name for the wallet (e.g., gen3_cred_yourlogin_1).    If provided the whole tnsnames.ora file, the net service name is the top/outer-most key.
  • You will also need a sqlnet.ora and tnsnames.ora files (make sure the path in sqlnet.ora points to the wallet files).
  • Set environment variable TNS_ADMIN to point to the directory where the *.ora files live.
  • Test connection:
    • sqlplus:
      sqlplus /@<net service name>
      select user from dual;
      quit;
    • Use test python program that prints who you connected to Oracle as if successful (Note: Must have python3 in your path.   If you haven't already, source /software/lsstsw/stack/loadLSST.bash) :   
      /project/production/oracle/test_conn.py <net service name>

Timeline

Thurs Gen3 meetingsci_hsc/RC2 RunningNCSA - BPSNCSA - OracleGen3DRP
2/21/2019Completed ci_hsc gen2 run (sqlite to load into Oracle), ci_hsc gen3 run (sqlite3) in pegasus to provide feedback if things are no longer working.
Jim Bosch Oracle account
Must provide updated weekly Gen3 science configs prior to NCSA run
2/28/2019Completed ci_hsc gen2 run (sqlite to load into Oracle), ci_hsc gen3 run (sqlite3) in pegasus to provide feedback if things are no longer working.
Completed: Init Oracle accounts+wallets (Nate - 03/01), nightly DB backups (03/04), weekly ingest of ci_hsc, install Oracle client and cx_Oracle on lsst-dev (03/01)

3/7/2019

Completed ci_hsc gen2 run (sqlite to load into Oracle) DM-18176 - Getting issue details... STATUS DM-18336 - Getting issue details... STATUS , ci_hsc gen3 run (sqlite3) in pegasus to provide feedback if things are no longer working. DM-18177 - Getting issue details... STATUS



Completed Filename template checking script

DM-18181 - Getting issue details... STATUS


3/14/2019

Completed ci_hsc gen2 run (sqlite to load into Oracle), DM-18176 - Getting issue details... STATUS DM-18336 - Getting issue details... STATUS
ci_hsc gen3 run (sqlite3) in pegasus to provide feedback if things are no longer working. DM-18178 - Getting issue details... STATUS



Decision about how to support multiple RDBMSs. Completed code changes for sqlite side. Code ready to start making Oracle changes.
3/21/2019

Completed ci_hsc gen2 run (sqlite to load into Oracle), DM-18176 - Getting issue details... STATUS DM-18336 - Getting issue details... STATUS
ci_hsc gen3 run (sqlite3) in pegasus to provide feedback if things are no longer working. DM-18179 - Getting issue details... STATUS

Completed BPS v0.1 exec config, allocateNodes (partitionable slots), DM-18351 - Getting issue details... STATUS Completed unique filename templates ci_hsc (where only requires config file change) DM-18356 - Getting issue details... STATUS

Completed: Easy way to initialize dev butler schema in Oracle
3/28/2019

Completed ci_hsc gen2 run (sqlite to load into Oracle) DM-18176 - Getting issue details... STATUS DM-18336 - Getting issue details... STATUS , ci_hsc gen3 run (sqlite3) in pegasus to provide feedback if things are no longer working. DM-18180 - Getting issue details... STATUS





4/4/2019

Completed ci_hsc gen2 run (sqlite to load into Oracle) DM-18782 - Getting issue details... STATUS , ci_hsc gen3 run (sqlite3) in pegasus to provide feedback if things are no longer working. DM-18783 - Getting issue details... STATUS


Completed: Oracle Butler works (no efficiency checks, just doesn't abort).  Selecting Oracle schema  DM-18629 - Getting issue details... STATUS , Table and view names case-insensitive on DB side. DM-17023 - Getting issue details... STATUS


4/11/2019

Completed ci_hsc gen2 run (sqlite to load into Oracle) DM-18782 - Getting issue details... STATUS , ci_hsc gen3 run (sqlite) in pegasus to provide feedback if things are no longer working.

DM-18825 - Getting issue details... STATUS

Completed BPS v0.1 status/history scripts  DM-18780 - Getting issue details... STATUS


Completed: Scripts to initialize ci_hsc repo for a Gen3 run without latest weekly Gen2 outputs.
4/18/2019

Completed ci_hsc gen2 run (sqlite to load into Oracle) DM-18782 - Getting issue details... STATUS , ci_hsc gen3 run (Oracle) in pegasus to provide feedback if things are no longer working.

DM-18830 - Getting issue details... STATUS



Completed: Mechanisms to create RC2 init repo
4/25/2019

Completed ci_hsc gen2 run (sqlite to load into Oracle) DM-18782 - Getting issue details... STATUS , ci_hsc gen3 run (Oracle) in pegasus to provide feedback if things are no longer working.

DM-18831 - Getting issue details... STATUS


Completed: RC2 init repo avail in Oracle

DM-18829 - Getting issue details... STATUS


Complete RC2 DRP pipeline includes always write output config options where needed.
4/25/2019
Freeze: features, API, schema
5/2/2019Start running RC2 and reporting problems



5/9/2019




5/16/2019




5/23/2019




5/30/2019

Completed: Can access Oracle Registry + GPFS DataStore from NCSA LSP

6/06/2019Milestone completed. Presentation during DMLT meeting June 04-06.  Includes instructions, any software installs, etc


  • No labels