Infrastructure meetings take place every other Thurs. at 9:00 Pacific on the BlueJeans infrastructure-meeting channel: https://bluejeans.com/383721668

Date

Goals

    • Alignment of NCSA-provided services with program needs
    • Ensure effective use of the current NCSA infrastructure
    • Refinement and continuous improvement of services, resources and processes
    • Plan for near- and medium-term activities


  • Discussion items
TopicWhoNotes
Review of last meeting notes
  • Any updates

Status updates

  • See below
  • Open tickets & outstanding issues

Ramping up support levels

(time permitting)

Unknown User (pdomagala)
  • Project needs
    • Need to involve science teams, developers
    • Need to build an "experts list" & call trees
  • Support level & schedules
    • during observing, expect 24x7 on-call for observing critical
  • Timeframe to implement
  • Tools



PDAC StatusGregory Dubois-Felsmann
  • Need kickoff meeting for AA, every
  • Firefly load balancer security assessment - IHS-388 - Getting issue details... STATUS
  • Cross-system "early integration" meeting at NCSA, Week of Oct. 9, need:
    • ops support
    • Work with Tony Johnson to deploy a pupettized server
Topics for next meeting



Status of LSST Infrastructure Projects

Item
Who
Notes
Disaster Recovery verificationAndrew Loftus
  • Done
NCSA 3003 RefreshBill Glick
  • Done
PROJECTS

Chile AA Deployment
  • Temporary setup in NCSA 3003 'SET' rack A18
  • Waiting on...
    • Arista switch
  • Kay Avila started setting up pfsense appliance
  • Need:

    • Network info, IP allocations

28 Aug 2017

L1 Cluster (40+ nodes)

28 Aug 2017

  • Doug Fein, order by Sept. 30, pending proj. office sign-off
L1 Orchestration Test Stand
  • Waiting for hardware
  • This is also in Outstanding Orders. Does it need to be here also?
  • What is difference between "L1 Cluster" and "L1 Test Stand"?
    • Orchestration might deploy prompt processing payload
    • L1 Complete Test Stand=DAQ+all the messaging & forwarding s/w
Outstanding OrdersMichelle Butler
  • Qserv-master -  IHS-378 - Getting issue details... STATUS
    • Michelle Butler status?
    • Paul Domagala needs to get this unstuck
    • UPDATE: It seems that I have unfairly blamed AURA when, in fact, the order is still stuck at NCSA. Silly of me to assume that when I was told a couple weeks ago that the order would go out by the end of the week, that it actually would. I'm working on getting it unstuck.
  • Deployment test nodes (4 lenovo, 2 Dell (1 chassis) )
  • lsst-db (replacement for current host) (Dell R740)
    • Pending project finalization
  • lsst-dbdev (replacements for systems in 3003)
    • Systems retired
  • L1 Orchestration Test Stand


Container Management
  • https://github.com/lsst/LDM-564/tree/tickets/DM-11468
  • Technology
    • Docker + Kubernetes?
    • Other?
  • Need timelines and priorities
  • Potential Use Cases
    • SUIT
    • jupyterhub
    • Qserv & DAC services
    • verification
    • alert distribution systems
    • squash L1/L2 QC
    • developer services (docs, jenkins, ........)
    • Possibles
      • Bulk data distribution
      • many, many micro-services (e.g. monitoring, pointing prediction service, TBD)
    • Do we need to revisit object stores? The infra team (and others) believe so.
Provisioning

Goal to combine hardware provisioning systems for NCSA 3003 with that in NPCF.

Possible technologies:

  • OS Deployment
    • Foreman
    • xCAT
  • Sofwtare Package Repository Management
    • Katello
    • Pakrat
Oracle TestingMichelle Butler
Puppet Baseline for LSST project-wide
Nebula MonitoringHarathi Korrapati
  • Working on integrating nebula instance to monitor01 (facing some issues and working on it)
  • Monitoring sites:
Cluster MonitoringHarathi Korrapati
  • Can we add in GPFS? – Added some of GPFS metrics to some of the nodes and testing them on Nagios, will be applying them to all the nodes this week.

    • Capacities (per fileset)

    • Inode status (per fileset)

  • Working with Paul Domagala on pagerduty material
    • Going through the pagerduty start guide and getting ready with questions for introductory meeting with Pagerduty engineers
  • Backup of metrics?
    • Currently backed up to local ZFS volume
    • Andrew Loftus Need a way to get a copy somewhere else (ie: GPFS, Crashplan)
      • Crashplan backups are always encrypted
      • Need to create and manage an encryption key

Action items

Please enter action items in the form

Responsible Person, Due Date, Description