Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date:   

Attendees: Unknown User (pdomagala)Margaret GelmanDonald PetravickJoel Plutchak

NOTE: statistics cover the past 2 weeks

Scheduled Maintenance

See the LSST Service Status Page

  • Note that the December maintenance window is moved from 12/21 to 12/14.
    • Nebula will not be affected. The L1 test stand would have had OS upgrades early Thurs. morning, but I've asked the admin's to hold off until Jan.  in order to provide maximum stability during the early integration excercise.
  • 2017-12-22 to 2018-1-01 (inclusive) is a University holiday period. Services will be operational. NCSA will respond to incidents based on business criticality.  During the holidays, incidents should be submitted as JIRA IHS tickets as we will not be monitoring slack channels as much as normal.


As-is Services

Incidents

  • 1 created, 1 resolved.

Created & Resolved:

Discussion of Notable Issues

      • Unexpected reboot of lsst-qserv-db16 (IHS-606).

        Firmware upgrades were successful and all nodes were returned to service Thur. 11/30.  Since then, there have been no unplanned reboots.

Requests

    • 4 created, 3 resolved3 resolved

Change Management

This process primarily targets requests that can be handled with current level of effort (LOE) resources.  This process is also designed to detect and redirect items to the EVMS process if they exceed LOE resources.

Changes proceed through 5 stages: 

1

Initial Assessment
Check that the submitter has stated a plausible business case and the relevant T/CAM agrees
2Feasibility AssessmentIs the change well-formulated, address a project need and cost-effective.
3PlanningA detailed implementation plan is created which takes into account impacts, resource needs, testing and verification.
4Implementation
The plan is executed to implement the change.
5AssessmentVerification of successful change & issues analysis
6ClosedDocumentation and formally close the request. close-out.



Open Change Requests


Key  SummaryProcess Stage†ReporterPCreatedStatus
IHS-576

Configure slurm to accept jobs to use only partial nodes

Planning

Tim MortonMajor02/Nov/17This change has been approved and is tentatively scheduled for early CY18.
IHS-580

DM developers need a build/test environment that supports docker containers

Planning

Joshua HoblittMinor02/Nov/17

Use case and requirements are being gathered. Unknown User (pdomagala) will discuss this with the PDAC working group this Thursday.

IHS-612

Implement debug and normal queues for developers on the verification cluster

Planning

Yusra AlSayyadMajor16/Nov/17Currently being planned and tested. Tentatively scheduled for deployment before 22/Dec/17
IHS-638

Install a tex distro on lsst-dev

Closed

Merlin Fisher-LevineMinor04/Dec/17Approved by M. Butler and completed on 08/Dec/17

Heard on the Street This Week, but no Ticket Filed

  • New

    • It was suggested that per-user storage usage for each shared fileset be made available.  Preferably readable by any DM member.

  • Previous

    • Several users expressed a desire to have the Intel compiler suite (icc) available on last-dev
    • Increase ssh idle session timeout, which is currently 1 hr. (John Parejko via Slack) 
    • Suggestion to deploy kubernetes on PDAC, it is assumed that this is being handled through the rolling-wave (EVMS) process
    • Tools for parallel programming in batch computing environment (gnu parallel and others)

Change Process Notes

  • Change process is being refined based on experience and feedback from exercising  it over the past month.

Problem Management

Report format under development

Interactions

  1. T/CAM interactions

    1. Protracted discussion with John SwinbankKian-Tat LimSimon Krughoff re. the need to have retrievability of items purged from the GPFS /scratch partition for some short period of time.  See 

      Jira
      serverJIRA
      columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
      serverId9da94fb6-5771-303d-a785-1b6c5ab0f2d2
      keyIHS-613

  1. ITSC

    1. Unknown User (jmatt) proposes in ITRFC-10 that the ITSC consider making recommendations for container and container orchestration best practices.  It's unclear if 

      Next meeting  


  1. PDAC

    1. Last PDAC meeting  

    2. Major topics of discussion were  need to finish this

      Next PDAC meeting   



  2. Summit-base Tiger Team

    1. Suspended until after the first of the year since Jeff is in Chile.  However, I’m linked in to Chile IT.


  3. Infrastructure

Next meeting 
    1. The 

 
    1.  meeting was focused on the upcoming  maintenance event.

Other business

Proposed that we install a standard Influx/telegraf/prometheus stack on the standard Nebula images.  Install a monitoring system in openstack to serve up the data/dashboards.

Action Items

New

From last week